A list of the functions created in this document.
ToDo:
Add: calculate_attrition_cost, count_to_pct, plot_attrition in scrips - assess_attrition
| Artifact | Thumbnail |
|---|---|
| Visualizations | |
| plot_attrition |
|
| plot_ggpairs |
|
| plot_hist |
|
| plot_cor |
|
| plot_leaderboard |
|
| Precision_Recall_plot |
|
| ROC_Plot |
|
| H2O_Performance_Plot |
|
| Plot_Lime_Feature_Explain |
|
| Plot_Lime_Features_Explanations |
|
| Confusion_Matrix_Plot |
|
| Correlation_Analysis |
|
| Functions | |
| get_cor |
|
| extract_h2o_model_name_by_position |
|
| Model_Metrics |
|
| calculate_savings_by_thresh_2 |
|
| Code Snippets | |
| Tidy_Eval Tidy_Eval | quos, enquo and !!! |
| rlang | rlang::sym turns a single character string into an expression. The expression is typically captured in enquo() or quos() to delay evaluation |
| anonymous | If you choose not to give the function a name, you get an anonymous function. |
| fct_recode | mutate(OverTime = fct_recode1(OverTime, ‘No’ = ‘Yes’) |
| case_when | mutate(above_industry_avg = case_when(pct > 0.088 ~ ‘Yes’, TRUE ~ ‘No’) |
| gather_example | Gather takes multiple columns and collapses into key-value pairs |
| fct_reorder2 | fct_reorder() is useful for 1d displays where the factor is mapped to position; fct_reorder2() for 2d displays where the factor is mapped to a non-position aesthetic |
| add_column | This is a convenient way to add one or more columns to an existing data frame. |
| partial | partial allows you to preload a function with arguments that do not change. Part of purrr |
| firstMap secondMap |
mutate and map are a powerful combination enabling row-wise iteration in a tidy way: mutate(metrics = map(model_id, get_model_performance_metrics, newdata_tbl)) |
| expand_limits | Change the range of axis values in ggplot: expand_limits(x = c(-1, 1)) |
| scale_x_continuous | scale_x_continuous(label = scales::dollar) |
| cross_df |
cross_df from purr is really helpful. From a list of variables, it produces all combinations of the list.
|
| scale_fill_gradient2 | scale_fill_gradient2 uses 2 gradient transitions by specifying low, mid and high arguments along with a midpoint to specify the middle transition point. There are variations like scale_fill_gradient that offers only a low and high point. scale_fill_gradient_n allows for any number of gradients. |
| str_detect | filter(str_detect(terms, ‘YearsAtCompany’)) |
Differentiating between productive and non-productive employees is difficult and highly subjective. Some employees are in the top 20% (all stars), some are in the middle 60% (productive), and some are in the bottom 20%. The bottom 20% can be OK or can be poor depending on the overall level of talent in your organization. Sometimes just because they are bottom 20% they are still good employees because the organization is such a high performing organization that everyone they employee is good. However, most organizations do have bad employees.
Now, in terms of defining productivity, if on a 1 to 4 scale with 1 being poor and 4 being the best that managers rate employees 3 to 4, you can usually be OK in viewing the 4’s as productive. They may not always be all stars, especially if 50% of people reviewed get 4’s but in reality only 20% should by the bell curve. However, if a manager rates someone as a 1, there is definitely a problem. These people need to either have immediate corrective actions taken or need to be cut loose. This is good attrition because the employee is a poor job/culture fit.
The analysis we are doing specifically looks at bad attrition, losing people that are 3’s and 4’s. I call them high performers, but in reality, I should be saying productive employees. This group is probably driving 80% of the business results.
Non-profits will need to be handled based on what the business finds important. While maximizing profit may not be the goal, there are still goals. For example, one goal might be to increase high school graduation rates in schools. This could be set up showing that the percentage of student dropouts are much higher in the district of focus and that increasing graduation rates is expected to reduce crime and increase workforce productivity. Crime reduction would have a savings associated with it. Workforce productivity would have a financial benefit associated with it. These are possible ways we can show financial value for a non-profit.
Non-Profit Cost Analysis: https://www.bridgespan.org/bridgespan/Images/articles/nonprofit-cost-analysis-toolkit/NonprofitCostsAnalysisToolkit.pdf
Non-Profit Strategic Planning: https://knowhownonprofit.org/organisation/strategy
The tools are quite similar to what you might find in a for-profit scenario. The maximization will be to maximize the financial delta between cost and benefit (to public) versus cost and revenue (for organization).
You takeaway our top 20 employees and we become a mediocre company –Bill Gates
Employee Retention Now a Big Issue: Why the Tide has Turned
When a good employee quits, costs are incurred.
Not all attrition is bad
\[500+10,000+4,900+3,500=18,900\]
\[(250,000/240)*(40+(60*.5))=72,917\]
\[(80,000/240)*40=13,333\] \[Total Costs = 18,900+72,917+13,333=78,483\]
If there 200 productive employees leave each year, then \(200 * 78,483 = 15,000,000\) is the financial cost of turnover.
Just a 10% reduction (using data science) saves the organization $1.6M/year!
Business Science Problem Framework
BSPF is aligned with industry standard CRISP-DM
CRISP-DM is a high level industry-standard framework for data mining.
A high level plan that is adaptable but lacks the detail to drive data science projects
CRISP-DM Process Cycle
The goal is to make good decisions. Systematic decision making starts with measurement and analysis which leads to improvement.
The DSBF is based on Principles by Ray Dalio
BSPF has 7 steps:
Each part of the business has Objectives –> Machine –> Outcomes.
Our goal is to understand the machine using data
When you have exhausted all possibilities, remember this - you haven’t.
The BSPF includes the following steps of the the CRISP-DM Framework:
Review business_understanding.R file in 01_Business Understanding folder
Business Units: Department and Job Role Define Objectives: Retain high performers Assess Outcomes: TBD - Feedback from the business
Known - 16.1% attrition in the current state - is this a bad thing?
Investigate the Objectives: 16% Attrition Synthesize Outcomes: High counts and high percentages (learned after hypothesizing drivers). See below under Understand the Drivers. Hypothesize Drivers: Job Role and Departments
## # A tibble: 6 x 4
## # Groups: Department [3]
## Department Attrition n pct
## <chr> <chr> <int> <dbl>
## 1 Human Resources No 37 0.755
## 2 Human Resources Yes 12 0.245
## 3 Research & Development No 721 0.867
## 4 Research & Development Yes 111 0.133
## 5 Sales No 291 0.789
## 6 Sales Yes 78 0.211
Department and Job Roles are common cohorts to evaluate.
HR has 24% attrition, Sales 21% - might be something going on by department.
dept_job_role_tbl %>% group_by(Department, JobRole, Attrition) %>% summarise(n = n()) %>% ungroup() %>%
group_by(Department, JobRole) %>% mutate(pct = n/sum(n)) %>% ungroup() %>% filter(Attrition %in% c("Yes"))## # A tibble: 10 x 5
## Department JobRole Attrition n pct
## <chr> <chr> <chr> <int> <dbl>
## 1 Human Resources Human Resources Yes 12 0.308
## 2 Research & Development Healthcare Representative Yes 8 0.0762
## 3 Research & Development Laboratory Technician Yes 49 0.219
## 4 Research & Development Manager Yes 2 0.0417
## 5 Research & Development Manufacturing Director Yes 7 0.0569
## 6 Research & Development Research Director Yes 2 0.0274
## 7 Research & Development Research Scientist Yes 43 0.166
## 8 Sales Manager Yes 2 0.0645
## 9 Sales Sales Executive Yes 50 0.183
## 10 Sales Sales Representative Yes 26 0.4
Note there are different roles with each department. Perhaps certain roles have greater attrition issues than others?
Note the HR 30%; Note Sales Executive - only 18.3% but there are 50 leaving; Sales Representative may be a problem too.
Now in the BSPF - Measure the Drivers Step
This is an iterative task that is ongoing.
You may not have all the data initially. Further, you don’t want all the data since many of the features are a distraction prior to understanding if a problem exists. During the Business Understanding phase we need to isolate the critical features that relate to business outcomes and determine if a problem exists such that when solved it provides real value in terms of financial benefit.
There are 35 features to consider:
## [1] "Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EmployeeCount" "EmployeeNumber"
## [11] "EnvironmentSatisfaction" "Gender"
## [13] "HourlyRate" "JobInvolvement"
## [15] "JobLevel" "JobRole"
## [17] "JobSatisfaction" "MaritalStatus"
## [19] "MonthlyIncome" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager"
Some features related to one and other.
It is always helpful to breakdown data collection activities in to strategic areas.
Often review turnover statistics in your industry. We know our is 16% but we do not know if this is good or bad without [research](http://www.air-prehire.com/blog/industries-high-employee-turnover-rates/.
Perhaps it simply fails an internal KPI.
Via web searching, 8.8% average turnover for utility companies (useful for this data). (Banking is around 17.2%) {#case_when}
## # A tibble: 10 x 6
## Department JobRole Attrition n pct above_industry_~
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Sales Sales Represen~ Yes 26 0.4 Yes
## 2 Human Resources Human Resources Yes 12 0.308 Yes
## 3 Research & Dev~ Laboratory Tec~ Yes 49 0.219 Yes
## 4 Sales Sales Executive Yes 50 0.183 Yes
## 5 Research & Dev~ Research Scien~ Yes 43 0.166 Yes
## 6 Research & Dev~ Healthcare Rep~ Yes 8 0.0762 No
## 7 Sales Manager Yes 2 0.0645 No
## 8 Research & Dev~ Manufacturing ~ Yes 7 0.0569 No
## 9 Research & Dev~ Manager Yes 2 0.0417 No
## 10 Research & Dev~ Research Direc~ Yes 2 0.0274 No
Note the Dept and Job Role with high percentages and counts. To make this more impactful, add cost to this table.
Create a function the mimics the Excel Calculator.
The function includes default values that equal the defaults that were introduced in Calculating the Cost of Turnover. So if we call the custom function calculate_attrition_cost with the default values, it returns 78483.33. If we change the default value of n = 1 (representing 1 employee) and change it to n = 200 then the function returns 15696666.67.
Using code that we have already seen above and modified it slightly to focus on JobRole rather than Department, we can begin this part of the journey reviewing JobRole data. Note the we reuse a custom function - calculate_attrition_cost to return the cost_of_attrition column below.
dept_job_role_tbl %>% count(JobRole, Attrition) %>% group_by(JobRole) %>% mutate(pct = n/sum(n)) %>% ungroup() %>%
filter(Attrition %in% c("Yes")) %>% arrange(desc(pct)) %>% mutate(above_industry_avg = case_when(pct > 0.088 ~ "Yes", TRUE ~ "No")) %>%
mutate(cost_of_attrition = calculate_attrition_cost(n = n, salary = 80000))## # A tibble: 9 x 6
## JobRole Attrition n pct above_industry_a~ cost_of_attriti~
## <chr> <chr> <int> <dbl> <chr> <dbl>
## 1 Sales Represe~ Yes 26 0.4 Yes 2040567.
## 2 Human Resourc~ Yes 12 0.308 Yes 941800.
## 3 Laboratory Te~ Yes 49 0.219 Yes 3845683.
## 4 Sales Executi~ Yes 50 0.183 Yes 3924167.
## 5 Research Scie~ Yes 43 0.166 Yes 3374783.
## 6 Healthcare Re~ Yes 8 0.0762 No 627867.
## 7 Manufacturing~ Yes 7 0.0569 No 549383.
## 8 Manager Yes 4 0.0449 No 313933.
## 9 Research Dire~ Yes 2 0.0274 No 156967.
Note the costs associated with Sales Executive and Laboratory Technician.
This section is dedicated to streamlining the code developed above.
This is code commonly used:
## # A tibble: 18 x 3
## # Groups: JobRole [?]
## JobRole Attrition n
## <chr> <chr> <int>
## 1 Healthcare Representative No 97
## 2 Healthcare Representative Yes 8
## 3 Human Resources No 27
## 4 Human Resources Yes 12
## 5 Laboratory Technician No 175
## 6 Laboratory Technician Yes 49
## 7 Manager No 85
## 8 Manager Yes 4
## 9 Manufacturing Director No 116
## 10 Manufacturing Director Yes 7
## 11 Research Director No 71
## 12 Research Director Yes 2
## 13 Research Scientist No 216
## 14 Research Scientist Yes 43
## 15 Sales Executive No 223
## 16 Sales Executive Yes 50
## 17 Sales Representative No 39
## 18 Sales Representative Yes 26
The code above is replaced with this simpler code:
## # A tibble: 18 x 3
## JobRole Attrition n
## <chr> <chr> <int>
## 1 Healthcare Representative No 97
## 2 Healthcare Representative Yes 8
## 3 Human Resources No 27
## 4 Human Resources Yes 12
## 5 Laboratory Technician No 175
## 6 Laboratory Technician Yes 49
## 7 Manager No 85
## 8 Manager Yes 4
## 9 Manufacturing Director No 116
## 10 Manufacturing Director Yes 7
## 11 Research Director No 71
## 12 Research Director Yes 2
## 13 Research Scientist No 216
## 14 Research Scientist Yes 43
## 15 Sales Executive No 223
## 16 Sales Executive Yes 50
## 17 Sales Representative No 39
## 18 Sales Representative Yes 26
count is part of dplyr!
Do not forget about the
dplyr::tallyfunction - very useful
tally() is a convenient wrapper for summarize that will either call n() or sum(n) depending on whether you’re tallying for the first time, or re-tallying. count() is similar but calls group_by() before and ungroup() after.
Pay attention - this is really good code using tidy evaluation!
We have this code so far in streamlining the attrition code:
## # A tibble: 18 x 4
## # Groups: JobRole [9]
## JobRole Attrition n pct
## <chr> <chr> <int> <dbl>
## 1 Healthcare Representative No 97 0.924
## 2 Healthcare Representative Yes 8 0.0762
## 3 Human Resources No 27 0.692
## 4 Human Resources Yes 12 0.308
## 5 Laboratory Technician No 175 0.781
## 6 Laboratory Technician Yes 49 0.219
## 7 Manager No 85 0.955
## 8 Manager Yes 4 0.0449
## 9 Manufacturing Director No 116 0.943
## 10 Manufacturing Director Yes 7 0.0569
## 11 Research Director No 71 0.973
## 12 Research Director Yes 2 0.0274
## 13 Research Scientist No 216 0.834
## 14 Research Scientist Yes 43 0.166
## 15 Sales Executive No 223 0.817
## 16 Sales Executive Yes 50 0.183
## 17 Sales Representative No 39 0.6
## 18 Sales Representative Yes 26 0.4
Calculating percentages from counts is a common programming requirements. Why not create a function that generalizes this?
count_to_pct <- function(data, ..., col = n){
grouping_vars_expr <- quos(...)
col_expr <- enquo(col)
ret <- data %>% group_by(!!! grouping_vars_expr) %>%
mutate(pct = (!! col_expr)/sum(!! col_expr)) %>% ungroup()
return(ret)
}... enables passing multiple, un-named arguments to a function
n column... must be enclosed on quos or enquo for multiple columns or single columns, respectfully. The saves the column names as unevalutated expressions. They can be called later in the processing. Use quos for multiple columns and enquo for the single column we have called n!!! called bang-bang-bang splices multiple grouping variables and evaluating them. (Use !! for single columns.)Demo:
## # A tibble: 21 x 5
## Department JobRole Attrition n pct
## <chr> <chr> <chr> <int> <dbl>
## 1 Human Resources Human Resources No 27 0.692
## 2 Human Resources Human Resources Yes 12 0.308
## 3 Human Resources Manager No 10 1
## 4 Research & Development Healthcare Representative No 97 0.924
## 5 Research & Development Healthcare Representative Yes 8 0.0762
## 6 Research & Development Laboratory Technician No 175 0.781
## 7 Research & Development Laboratory Technician Yes 49 0.219
## 8 Research & Development Manager No 46 0.958
## 9 Research & Development Manager Yes 2 0.0417
## 10 Research & Development Manufacturing Director No 116 0.943
## # ... with 11 more rows
Continuing the development of a streamlined function, review where we are so far:
dept_job_role_tbl %>% count(Department, JobRole, Attrition) %>% count_to_pct(Department, JobRole) %>%
filter(Attrition %in% c("Yes")) %>% arrange(desc(pct)) %>%
mutate(above_industry_avg = case_when(pct > 0.088 ~ "Yes", TRUE ~ "No"))## # A tibble: 10 x 6
## Department JobRole Attrition n pct above_industry_~
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Sales Sales Represen~ Yes 26 0.4 Yes
## 2 Human Resources Human Resources Yes 12 0.308 Yes
## 3 Research & Dev~ Laboratory Tec~ Yes 49 0.219 Yes
## 4 Sales Sales Executive Yes 50 0.183 Yes
## 5 Research & Dev~ Research Scien~ Yes 43 0.166 Yes
## 6 Research & Dev~ Healthcare Rep~ Yes 8 0.0762 No
## 7 Sales Manager Yes 2 0.0645 No
## 8 Research & Dev~ Manufacturing ~ Yes 7 0.0569 No
## 9 Research & Dev~ Manager Yes 2 0.0417 No
## 10 Research & Dev~ Research Direc~ Yes 2 0.0274 No
Generalize filter - arrange - mutate statements:
assess_attrition <- function(data, attrition_col, attrition_value, baseline_pct){
attrition_col_expr <- enquo(attrition_col)
data %>% filter((!! attrition_col_expr) %in% attrition_value) %>% arrange(desc(pct)) %>%
mutate(above_industry_avg = case_when(pct > baseline_pct ~ "Yes", TRUE ~ "No"))}Attrition column - it may not always be called Attritionattrition_col - in our case the values are Yes and No.baseline_pct is hopefully self-explanatory!enquo saves a single column name as an expression that can be evaluated later. This way the name of the Attrition column can be whatever the user chooses!pct does not change because that is created consistently in the count_to_pct function. (Of course this could be changed as an enhancement!)0.088 to baseline_pct.Use parenthesis to give Tidy Eval evaluations priority!
Function inputs in text format (attrition_value = “Yes”) do not require Tidy Eval
Demo the new code:
dept_job_role_tbl %>%
count(Department, JobRole, Attrition) %>% count_to_pct(Department, JobRole) %>%
assess_attrition(Attrition, attrition_value = "Yes", baseline_pct = 0.088) %>% #New code line
mutate(cost_of_attrition = calculate_attrition_cost(n = n, salary = 80000))## # A tibble: 10 x 7
## Department JobRole Attrition n pct above_industry_~
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Sales Sales ~ Yes 26 0.4 Yes
## 2 Human Res~ Human ~ Yes 12 0.308 Yes
## 3 Research ~ Labora~ Yes 49 0.219 Yes
## 4 Sales Sales ~ Yes 50 0.183 Yes
## 5 Research ~ Resear~ Yes 43 0.166 Yes
## 6 Research ~ Health~ Yes 8 0.0762 No
## 7 Sales Manager Yes 2 0.0645 No
## 8 Research ~ Manufa~ Yes 7 0.0569 No
## 9 Research ~ Manager Yes 2 0.0417 No
## 10 Research ~ Resear~ Yes 2 0.0274 No
## # ... with 1 more variable: cost_of_attrition <dbl>
Create impactful visualization like this:
Here is the code so far plus some data manipulation so we can produce the plot. {#fct_reorder}
dept_job_role_tbl %>%
count(Department, JobRole, Attrition) %>% count_to_pct(Department, JobRole) %>%
assess_attrition(Attrition, attrition_value = "Yes", baseline_pct = 0.088) %>%
mutate(cost_of_attrition = calculate_attrition_cost(n = n, salary = 80000)) %>%
# Data Manipulation
mutate(name = str_c(Department, JobRole, sep = ": ") %>% as_factor()) %>% #LINE 1
mutate(name = fct_reorder(name, cost_of_attrition)) %>% # LINE 2
mutate(cost_text = str_c("$", format(cost_of_attrition / 1e6, digits = 2), "M", sep = "")) # LINE 3## # A tibble: 10 x 9
## Department JobRole Attrition n pct above_industry_~
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Sales Sales ~ Yes 26 0.4 Yes
## 2 Human Res~ Human ~ Yes 12 0.308 Yes
## 3 Research ~ Labora~ Yes 49 0.219 Yes
## 4 Sales Sales ~ Yes 50 0.183 Yes
## 5 Research ~ Resear~ Yes 43 0.166 Yes
## 6 Research ~ Health~ Yes 8 0.0762 No
## 7 Sales Manager Yes 2 0.0645 No
## 8 Research ~ Manufa~ Yes 7 0.0569 No
## 9 Research ~ Manager Yes 2 0.0417 No
## 10 Research ~ Resear~ Yes 2 0.0274 No
## # ... with 3 more variables: cost_of_attrition <dbl>, name <fct>,
## # cost_text <chr>
str_c from stringr concatenates Department and JobRole with a colon in between them.
ggplot because they are used to order categorical variables (text will not work)as_factor is part of forcats. This function creates levels in the order in which they appear, which will be the same on every platform. (Base R sorts in the current locale which can vary from place to place.)fct_reorder also from forcats reorders a factors numeric values by the magnitude of a different numeric value, in this case, cost_of_attrition.str_c concatenation again - pretty simple - but effective!OK, the data manipulation to plot the cost or attrition is complete. Now review the ggpolt code to produce the plot shown in the beginning on this section. It is easy - provided you do one step at a time!
dept_job_role_tbl %>%
count(Department, JobRole, Attrition) %>% count_to_pct(Department, JobRole) %>%
assess_attrition(Attrition, attrition_value = "Yes", baseline_pct = 0.088) %>%
mutate(cost_of_attrition = calculate_attrition_cost(n = n, salary = 80000)) %>%
# Data Manipulation
mutate(name = str_c(Department, JobRole, sep = ": ") %>% as_factor()) %>% #must be factors to order
mutate(name = fct_reorder(name, cost_of_attrition)) %>% # orders the names by cost
mutate(cost_text = str_c("$", format(cost_of_attrition / 1e6, digits = 2), "M", sep = "")) %>%
# Plotting
ggplot(aes(x = cost_of_attrition, y = name)) +
geom_segment(aes(xend = 0, yend = name), color = palette_light()[[1]]) + # LINE 1
geom_point(aes(size = cost_of_attrition), color = palette_light()[[1]]) + # LINE 2
scale_x_continuous(label = scales::dollar) + # LINE 3
geom_label(aes(label = cost_text, size = cost_of_attrition), hjust = "inward", color = palette_light()[[1]]) + # LINE 4
theme_tq() + scale_size(name = c(4,5)) + # LINE 5
labs(title = "Estimated Cost of Attrition: By Dept and Job Role", y = "", x = "Cost of Attrition") +
theme(legend.position = "none")geom_segment adds lines to the plot canvas. The first ggplot statement locates the end of the line - at the cost_of_attrition. xend defines the start of the line segment at x = 0. yend make sure the line segment aligns with the correct y-axis label. Simple!
palette_light()[[1]] defines the color #2c3e5 from the tidyquant package.palette_light are: #2c3e50, #e31a1c, #18BC9C, #CCBE93, #a6cee3, #1f78b4, #b2df8a, #fb9a99, #fdbf6f, #ff7f00, #cab2d6, #6a3d9asize = cost_of_attrition adjusts the size of the segment endpoint to the attrition cost magnitude (could have included this in the first aes statement)scales package to get the values in dollar format. Remember the scale_x_continuous ggplot function!geom_label simply adds labels to points on the plot using cost_text` that was created easier in the plot data manipulation.
theme_tq from tidyquant is a general theme that adjusts background, faceting palette, legend position and more. Works well for conservative business publications.
scale_size adjusts the max and min size of elements to prevent large/small values from becoming too large/small. The label sizes are now limited to size = 3 making the lower values label text now readable. Cool!The goal is to replace the code below - copied from the last plotting exercise:
dept_job_role_tbl %>%
count(Department, JobRole, Attrition) %>% count_to_pct(Department, JobRole) %>%
assess_attrition(Attrition, attrition_value = "Yes", baseline_pct = 0.088) %>%
mutate(cost_of_attrition = calculate_attrition_cost(n = n, salary = 80000)) %>%
###<b>
#Data Manipulation
mutate(name = str_c(Department, JobRole, sep = ": ") %>% as_factor()) %>%
mutate(name = fct_reorder(name, cost_of_attrition)) %>% # orders the names by cost
mutate(cost_text = str_c("$", format(cost_of_attrition / 1e6, digits = 2), "M", sep = "")) %>%
#Plotting
ggplot(aes(x = cost_of_attrition, y = name)) +
geom_segment(aes(xend = 0, yend = name), color = palette_light()[[1]]) +
geom_point(aes(size = cost_of_attrition), color = palette_light()[[1]]) +
scale_x_continuous(label = scales::dollar) +
geom_label(aes(label = cost_text, size = cost_of_attrition),
hjust = "inward", color = palette_light()[[1]]) +
theme_tq() + scale_size(name = c(4,5)) +
labs(title = "Estimated Cost of Attrition: By Dept and Job Role",
y = "", x = "Cost of Attrition") + theme(legend.position = "none")Do no be intimidated - it is not as hard as it might appear initially!
First, an explanation of quosures and tidy evaluation:
There are four functions: - quo() and enquo(): singular, captures one variable, stored as unevaluated expression - quos() and enquos(): plural, captures multiple variables, stored as list of multiple unevaluated expressions
Explanation: enquo() is singular and used inside a function. It is comparable to quo() which is used outside of functions. Both functions do roughly the same thing, which is quote (meaning capturing the unevaluated variable as an expression before it gets updated). The “quoted” variable is stored for evaluation later. The main difference between quo() and enquo() is quo() should be used outside functions and enquo() should be used inside. The reason why is not important and really complex.
The rlang library now has an enquos() function that is plural (rlang did not always have enquos() which is why I usedquos()was used.. Both capture multiple variables as expressions contained inside a list. This is useful when you don't know how many arguments the user will supply. For example several columns. The difference betweenquos()andenquos()is, again, thatquos()is meant to be used external to functions whereasenquos()is meant to be used internal. However - This is not what the tutorial shows because of an inconsistency with early versions ofrlangthat did not include theenquos()` function.
rlang::sym turns a single character string into an expression. The expression is typically captured in enquo()orquos()to delay evaluation. Thesym()andsyms()functions are only necessary when converting a string to a symbol. Generally the user will supply a symbol, such as a column name. They don’t typically supply a string. Therefore, we only need to usesym/syms` in this special case where we have a string that represents one or more column names.
In summary: - Use quo() / enquo() when capturing one variable. - Use quos() / enquos() when capturing multiple variables. - Use quo() / quos() outside of functions. - Use enquo() / enquos() inside of functions.
plot_attrition <- function(data, ..., .value, # ... capture the departments - mutate(name = str_c(Department, JobRole....; .value - cost_of_attrition
fct_reorder = TRUE, # order the function by the .value argument
fct_rev = FALSE, # ascending vs descending y lables
include_lbl = TRUE,
color = palette_light()[[1]],
units = c("0", "K", "M")) {
#sometimes the untis will not be in millions, default is 0 - no units
# Inputs
group_vars_expr <- quos(...) #perhaps department and JobRoles
if(length(group_vars_expr) == 0) # Then just use the first column name - these drive the y labels in the plot so there has to be something!
group_vars_expr <- quos(rlang::sym(colnames(data)[[1]]))
value_expr <- enquo(.value)# for data manipulation (its cost_of_attrition - used in data manipulation and plotting)
value_name <- quo_name(value_expr)# for ggplot - makes a character string in quotes (opposite of rlang::sym()); ggplot is not ready for tidy eval
units_val <- switch(units[[1]], "M" = 1e6, "K" = 1e3, "0" = 1) #work like nested if functions
if (units[[1]] == "0") units <- "" #Make it empty because it would get appended to labels in the plot
# Data Manipulation
#Function factory - a function that produces a function, usd is the function factory - try it usd(1000)
usd <- scales::dollar_format(prefix = "$", largest_with_cents = 1e3)
data_manipulated <- data %>%
mutate(name = str_c(!!! group_vars_expr, sep = ": ") %>% as_factor()) %>%
#Above () removed around !!! group_vars_expr otherwise str_c sees it a single input - it is a psuedo bug in str_c
mutate(value_text = str_c(usd(!! value_expr / units_val), units[[1]], sep = ""))
if(fct_reorder){data_manipulated <- data_manipulated %>%
mutate(name = fct_reorder(name, (!! value_expr))) %>%
arrange(name)} # arrange() is used because fct_reorder() does not actually sort the data, arrange() does the sorting
if(fct_rev){data_manipulated <- data_manipulated %>%
mutate(name = fct_rev(name)) %>%
arrange(name)}
# Visualization
# NOTE: ggplot does not work with tidyeval framework - requires use of aes_string
g <- data_manipulated %>%
ggplot(aes_string(x = value_name, y = "name")) +
geom_segment(aes(xend = 0, yend = name), color = color) +
geom_point(aes_string(size = value_name), color = color) +
scale_x_continuous(label = scales::dollar) +
theme_tq() +
scale_size(name = c(4,5)) + #c(3,5) should work but I am not satifies with the output.
theme(legend.position = "none")
if(include_lbl){
g <- g + geom_label(aes_string(label = "value_text", size = value_name), hjust = "inward", color = color)
}
return(g)
} A potential problem with an error when trying to use this function on a grouped data set.
The warning thrown looks like this: mutate_impl(.data, dots) : binding character and factor vector, coercing into character vector The solution is to add ungroup() before the factoring within the Data Manipulation step.
Using estimated average salary and average productivity (revenue) by Job Role and Department, create a more realistic attrition cost estimate.
See homework_1_assignment.R for some relatively complex dplyr data code.
Here is a peek at the new Excel file that provides the source information.
dept_jobrole_tbl <- train_raw_tbl %>%
select(EmployeeNumber, Department, JobRole, PerformanceRating, Attrition)
kpi_industry_turnover_pct <- 0.088
# Productivity Cost by Role ----
productivity_cost_by_role_tbl <- read_excel("Challenges/productivity_cost_by_role.xlsx")
productivity_cost_by_role_tbl## # A tibble: 11 x 4
## Department JobRole Salary_Average Revenue_Average
## <chr> <chr> <dbl> <dbl>
## 1 Human Resources Human Resources 65000 104000
## 2 Human Resources Manager 85000 153000
## 3 Research & Develop~ Healthcare Represen~ 45000 76500
## 4 Research & Develop~ Laboratory Technici~ 65000 110500
## 5 Research & Develop~ Manager 88000 158400
## 6 Research & Develop~ Manufacturing Direc~ 100000 220000.
## 7 Research & Develop~ Research Director 125000 312500
## 8 Research & Develop~ Research Scientist 95000 171000
## 9 Sales Manager 95000 190000
## 10 Sales Sales Executive 140000 308000
## 11 Sales Sales Representative 80000 160000
dept_jobrole_productivty_tbl <- dept_jobrole_tbl %>%
count(Department, JobRole, Attrition) %>%
count_to_pct(Department, JobRole) %>%
left_join(productivity_cost_by_role_tbl, by = c("Department", "JobRole")) %>%
assess_attrition(Attrition,attrition_value = 'Yes',baseline_pct = 0.088) %>%
mutate(attrition_cost = calculate_attrition_cost(n = n, salary = Salary_Average,
net_revenue_per_employee = Revenue_Average))
dept_jobrole_productivty_tbl %>% plot_attrition(Department, JobRole, .value = attrition_cost)dept_productivty_tbl <- dept_jobrole_tbl %>%
count(Department, JobRole, Attrition) %>%
count_to_pct(Department, JobRole) %>%
assess_attrition(Attrition, "Yes", kpi_industry_turnover_pct) %>%
left_join(productivity_cost_by_role_tbl, by = c("Department", "JobRole")) %>%
group_by(Department) %>%
summarise(n = sum(n),
Salary_Average = sum(Salary_Average), Revenue_Average = sum(Revenue_Average)) %>%
mutate(attrition_cost =
calculate_attrition_cost(n = n, salary = Salary_Average,
net_revenue_per_employee = Revenue_Average))
dept_productivty_tbl %>% plot_attrition(Department, .value = attrition_cost)This is the start of the 2nd phase of CRISP-DM - Data Understanding.
This section introduces skimr - a tool for feature exploration by data type. GGally is also introduced to visualize feature interaction relationships using ggpairs.
May want to review the feature grouping introduced earlier. Group Features
skimr does provide a good overview of the data automatically separating character and numeric data. For character data, it is simple to see if any features have more than perhaps 9 unique features, it might suggest binning. The numeric data with histograms makes it easy to see which variable might be categorical data very quickly.
## Skim summary statistics
## n obs: 1250
## n variables: 35
##
## -- Variable type:character ---------------------------------------------------------------------------------------------
## variable missing complete n min max empty n_unique
## Attrition 0 1250 1250 2 3 0 2
## BusinessTravel 0 1250 1250 10 17 0 3
## Department 0 1250 1250 5 22 0 3
## EducationField 0 1250 1250 5 16 0 6
## Gender 0 1250 1250 4 6 0 2
## JobRole 0 1250 1250 7 25 0 9
## MaritalStatus 0 1250 1250 6 8 0 3
## Over18 0 1250 1250 1 1 0 1
## OverTime 0 1250 1250 2 3 0 2
##
## -- Variable type:numeric -----------------------------------------------------------------------------------------------
## variable missing complete n mean sd p0
## Age 0 1250 1250 36.95 9.08 18
## DailyRate 0 1250 1250 799.74 403.52 102
## DistanceFromHome 0 1250 1250 9.1 8.1 1
## Education 0 1250 1250 2.91 1.03 1
## EmployeeCount 0 1250 1250 1 0 1
## EmployeeNumber 0 1250 1250 1022.71 600.42 1
## EnvironmentSatisfaction 0 1250 1250 2.71 1.09 1
## HourlyRate 0 1250 1250 65.82 20.14 30
## JobInvolvement 0 1250 1250 2.75 0.71 1
## JobLevel 0 1250 1250 2.08 1.12 1
## JobSatisfaction 0 1250 1250 2.73 1.11 1
## MonthlyIncome 0 1250 1250 6569.4 4777.89 1009
## MonthlyRate 0 1250 1250 14294.5 7071.61 2097
## NumCompaniesWorked 0 1250 1250 2.71 2.49 0
## PercentSalaryHike 0 1250 1250 15.23 3.69 11
## PerformanceRating 0 1250 1250 3.16 0.36 3
## RelationshipSatisfaction 0 1250 1250 2.71 1.09 1
## StandardHours 0 1250 1250 80 0 80
## StockOptionLevel 0 1250 1250 0.8 0.86 0
## TotalWorkingYears 0 1250 1250 11.4 7.79 0
## TrainingTimesLastYear 0 1250 1250 2.81 1.29 0
## WorkLifeBalance 0 1250 1250 2.76 0.71 1
## YearsAtCompany 0 1250 1250 7.09 6.21 0
## YearsInCurrentRole 0 1250 1250 4.26 3.66 0
## YearsSinceLastPromotion 0 1250 1250 2.21 3.2 0
## YearsWithCurrManager 0 1250 1250 4.16 3.59 0
## p25 p50 p75 p100 hist
## 30 36 43 60 <U+2582><U+2583><U+2586><U+2587><U+2585><U+2583><U+2582><U+2582>
## 464 798 1156 1498 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
## 2 7 13.75 29 <U+2587><U+2585><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
## 2 3 4 5 <U+2582><U+2585><U+2581><U+2587><U+2581><U+2586><U+2581><U+2581>
## 1 1 1 1 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 486.25 1020.5 1553.75 2068 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
## 2 3 4 4 <U+2585><U+2581><U+2585><U+2581><U+2581><U+2587><U+2581><U+2587>
## 48 66 83 100 <U+2586><U+2587><U+2587><U+2587><U+2586><U+2587><U+2587><U+2587>
## 2 3 3 4 <U+2581><U+2581><U+2583><U+2581><U+2581><U+2587><U+2581><U+2582>
## 1 2 3 5 <U+2587><U+2587><U+2581><U+2583><U+2581><U+2582><U+2581><U+2581>
## 2 3 4 4 <U+2585><U+2581><U+2585><U+2581><U+2581><U+2587><U+2581><U+2587>
## 2935.25 4903.5 8395 19999 <U+2587><U+2587><U+2583><U+2582><U+2581><U+2581><U+2581><U+2582>
## 8191.25 14328 20337.25 26999 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2586>
## 1 2 4 9 <U+2587><U+2582><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581>
## 12 14 18 25 <U+2587><U+2587><U+2583><U+2583><U+2582><U+2582><U+2582><U+2581>
## 3 3 3 4 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582>
## 2 3 4 4 <U+2585><U+2581><U+2586><U+2581><U+2581><U+2587><U+2581><U+2587>
## 80 80 80 80 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
## 0 1 1 3 <U+2587><U+2581><U+2587><U+2581><U+2581><U+2582><U+2581><U+2581>
## 6 10 15 40 <U+2583><U+2587><U+2582><U+2582><U+2582><U+2581><U+2581><U+2581>
## 2 3 3 6 <U+2581><U+2581><U+2587><U+2587><U+2581><U+2582><U+2582><U+2581>
## 2 3 3 4 <U+2581><U+2581><U+2583><U+2581><U+2581><U+2587><U+2581><U+2582>
## 3 5 10 40 <U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 2 3 7 18 <U+2587><U+2583><U+2581><U+2585><U+2581><U+2581><U+2581><U+2581>
## 0 1 3 15 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## 2 3 7 17 <U+2587><U+2583><U+2581><U+2583><U+2581><U+2581><U+2581><U+2581>
## Observations: 1,250
## Variables: 9
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No...
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel_R...
## $ Department <chr> "Sales", "Research & Development", "Research & ...
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "Lif...
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "Ma...
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Labor...
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Marr...
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "...
Character data does not have levels so use unique instead. map from purrr maps unique across each feature.
## $Attrition
## [1] "Yes" "No"
##
## $BusinessTravel
## [1] "Travel_Rarely" "Travel_Frequently" "Non-Travel"
##
## $Department
## [1] "Sales" "Research & Development"
## [3] "Human Resources"
##
## $EducationField
## [1] "Life Sciences" "Other" "Medical"
## [4] "Marketing" "Technical Degree" "Human Resources"
##
## $Gender
## [1] "Female" "Male"
##
## $JobRole
## [1] "Sales Executive" "Research Scientist"
## [3] "Laboratory Technician" "Manufacturing Director"
## [5] "Healthcare Representative" "Manager"
## [7] "Sales Representative" "Research Director"
## [9] "Human Resources"
This improves upon unique by including the counts for each character feature.
## $Attrition
##
## No Yes
## 1049 201
##
## $BusinessTravel
##
## Non-Travel Travel_Frequently Travel_Rarely
## 124 239 887
##
## $Department
##
## Human Resources Research & Development Sales
## 49 832 369
##
## $EducationField
##
## Human Resources Life Sciences Marketing Medical
## 23 520 135 392
## Other Technical Degree
## 66 114
##
## $Gender
##
## Female Male
## 491 759
##
## $JobRole
##
## Healthcare Representative Human Resources
## 105 39
## Laboratory Technician Manager
## 224 89
## Manufacturing Director Research Director
## 123 73
## Research Scientist Sales Executive
## 259 273
## Sales Representative
## 65
Counts are nice but proportions are better! Below, anonymous function is used. {#anonymous}
The
~indicates a formula starts
Also understand that train_raw_tbl %>% select_if(is.character) %>% map(~ table(.)) is the sames as train_raw_tbl %>% select_if(is.character) %>% map(table).
# Anonymous function with ~
train_raw_tbl %>% select_if(is.character) %>% map(~ table(.) %>% prop.table()) ## $Attrition
## .
## No Yes
## 0.84 0.16
##
## $BusinessTravel
## .
## Non-Travel Travel_Frequently Travel_Rarely
## 0.099 0.191 0.710
##
## $Department
## .
## Human Resources Research & Development Sales
## 0.039 0.666 0.295
##
## $EducationField
## .
## Human Resources Life Sciences Marketing Medical
## 0.018 0.416 0.108 0.314
## Other Technical Degree
## 0.053 0.091
##
## $Gender
## .
## Female Male
## 0.39 0.61
##
## $JobRole
## .
## Healthcare Representative Human Resources
## 0.084 0.031
## Laboratory Technician Manager
## 0.179 0.071
## Manufacturing Director Research Director
## 0.098 0.058
## Research Scientist Sales Executive
## 0.207 0.218
## Sales Representative
## 0.052
##
## $MaritalStatus
## .
## Divorced Married Single
## 0.22 0.46 0.32
##
## $Over18
## .
## Y
## 1
##
## $OverTime
## .
## No Yes
## 0.72 0.28
map_df provides a much cleaner output! Outputs a dataframe rather than a list.
## # A tibble: 1 x 26
## Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber
## <int> <int> <int> <int> <int> <int>
## 1 43 808 29 5 1 1250
## # ... with 20 more variables: EnvironmentSatisfaction <int>,
## # HourlyRate <int>, JobInvolvement <int>, JobLevel <int>,
## # JobSatisfaction <int>, MonthlyIncome <int>, MonthlyRate <int>,
## # NumCompaniesWorked <int>, PercentSalaryHike <int>,
## # PerformanceRating <int>, RelationshipSatisfaction <int>,
## # StandardHours <int>, StockOptionLevel <int>, TotalWorkingYears <int>,
## # TrainingTimesLastYear <int>, WorkLifeBalance <int>,
## # YearsAtCompany <int>, YearsInCurrentRole <int>,
## # YearsSinceLastPromotion <int>, YearsWithCurrManager <int>
{#gather_example}
train_raw_tbl %>% select_if(is.numeric) %>% map_df(~ unique(.) %>% length()) %>%
gather() %>% arrange(value) %>% filter(value <= 10) # features with more than 10 might be nonessential or categorical## # A tibble: 13 x 2
## key value
## <chr> <int>
## 1 EmployeeCount 1
## 2 StandardHours 1
## 3 PerformanceRating 2
## 4 EnvironmentSatisfaction 4
## 5 JobInvolvement 4
## 6 JobSatisfaction 4
## 7 RelationshipSatisfaction 4
## 8 StockOptionLevel 4
## 9 WorkLifeBalance 4
## 10 Education 5
## 11 JobLevel 5
## 12 TrainingTimesLastYear 7
## 13 NumCompaniesWorked 10
Numeric features that have fewer levels are likely to be discrete and numeric variables with more are likely continuous. (Note above the features where value = 1 may indicate non-essential or zero-variance variables.) Above are features that might be converted to factors. The table below suggests the features remain continuous.
train_raw_tbl %>% select_if(is.numeric) %>% map_df(~ unique(.) %>% length()) %>%
gather() %>% arrange(value) %>% filter(value > 10) # features with more than 10 might be nonessential or categorical## # A tibble: 13 x 2
## key value
## <chr> <int>
## 1 PercentSalaryHike 15
## 2 YearsSinceLastPromotion 16
## 3 YearsWithCurrManager 18
## 4 YearsInCurrentRole 19
## 5 DistanceFromHome 29
## 6 YearsAtCompany 37
## 7 TotalWorkingYears 40
## 8 Age 43
## 9 HourlyRate 71
## 10 DailyRate 808
## 11 MonthlyIncome 1161
## 12 MonthlyRate 1217
## 13 EmployeeNumber 1250
To begin the visualization of the data, select the descriptive variables (serves as an example). There is a mix of character and continuous features.
## # A tibble: 1,250 x 7
## Attrition Age Gender MaritalStatus NumCompaniesWor~ Over18
## <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 Yes 41 Female Single 8 Y
## 2 No 49 Male Married 1 Y
## 3 Yes 37 Male Single 6 Y
## 4 No 33 Female Married 1 Y
## 5 No 27 Male Married 9 Y
## 6 No 32 Male Single 0 Y
## 7 No 59 Female Married 4 Y
## 8 No 30 Male Divorced 1 Y
## 9 No 38 Male Single 0 Y
## 10 No 36 Male Married 6 Y
## # ... with 1,240 more rows, and 1 more variable: DistanceFromHome <dbl>
Apply ggpair function:
Customize the output to make it a bit easier to interpret.
alphatrain_raw_tbl %>% select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome) %>%
ggpairs(aes(color = Attrition), lower = "blank", legend = 1, diag = list(continuous = wrap("densityDiag", alpha = 0.5))) +
theme(legend.position = "bottom")Take the visualization above and create a function.
color= NULL will output normal grey. However, best to select a feature to highlight (typically the label)
rlang::quo_is_null simply tests for if color_expr == NULL. Test by simply typing color_expr in the consolequo_name, aes_string must be usedplot_ggpairs <- function(data, color = NULL, density_alpha = 0.5){
color_expr <- enquo(color)
if(rlang::quo_is_null(color_expr)){gdata %>% ggpairs(lowe = "blank")
} else {
color_name <- quo_name(color_expr)
g <- data %>%
ggpairs(mapping = aes_string(color = color_name),
lower = "blank", legend = 1, diag = list(continuous = wrap("densityDiag",
alpha = density_alpha))) + theme(legend.position = "bottom")}
return(g)}Evaluate the data features by category and understand the importance of the features using the new ggpairs function. How great is that?
Observations
Contains is NOT case sensitive. Also provides
end_withoption
Observations
train_raw_tbl %>% select(Attrition, contains("income"), contains("rate"), contains("salary"),
contains("stock")) %>% plot_ggpairs(Attrition)Observations
Observations
Observations
Observations
Observations
There are to goals for this section:
recipies to build machine learning preprocessing templates.Must merge the telco-train.xlsx and telco_definitions.xlsx files to present data in a meaningful way and that is the purpose of the processing pipeline is.
Look at an example:
The plot above is not very readable.
The human readable pipeline will correct this and other pieces to make the plots more user-friendly.
Explore the data definitions file - note that many of the features are coded - numbers used as values for descriptions for categorical data.
path_data_definitions <- "00_Data/telco_data_definitions.xlsx"
definitions_raw_tbl <- read_excel(path_data_definitions, sheet = 1, col_names = FALSE)
definitions_raw_tbl## # A tibble: 35 x 2
## X__1 X__2
## <chr> <chr>
## 1 Education 1 'Below College'
## 2 <NA> 2 'College'
## 3 <NA> 3 'Bachelor'
## 4 <NA> 4 'Master'
## 5 <NA> 5 'Doctor'
## 6 <NA> <NA>
## 7 EnvironmentSatisfaction 1 'Low'
## 8 <NA> 2 'Medium'
## 9 <NA> 3 'High'
## 10 <NA> 4 'Very High'
## # ... with 25 more rows
X__1 = Feature Name; x__2 = Feature Code + Feature Description
Tidying this data will allow us to replace numbers with something more meaningful using tidyverse functions.
fill replaces missing values (NAs) with the closet entry (previous if .direction = "down" or next if .direction = "up")
Education under X__1 with Education - very cool!X__2 into 2 columns: one with the number, the other with the description
separate is from tidyrremove simply removes the original X__2 columnvalue column
str_replace replaces a matched pattern in a string with a replacement argument definitions_tbl <- definitions_raw_tbl %>% fill(X__1, .direction = "down") %>% filter(!is.na(X__2)) %>%
separate(X__2, into = c("key", "value"), sep = " '", remove = TRUE) %>%
rename(column_name = X__1) %>% mutate(key = as.numeric(key)) %>%
mutate(value = value %>% str_replace(pattern = "'", replacement = ""))
definitions_tbl## # A tibble: 29 x 3
## column_name key value
## <chr> <dbl> <chr>
## 1 Education 1 Below College
## 2 Education 2 College
## 3 Education 3 Bachelor
## 4 Education 4 Master
## 5 Education 5 Doctor
## 6 EnvironmentSatisfaction 1 Low
## 7 EnvironmentSatisfaction 2 Medium
## 8 EnvironmentSatisfaction 3 High
## 9 EnvironmentSatisfaction 4 Very High
## 10 JobInvolvement 1 Low
## # ... with 19 more rows
definitions_tbl has multiple datasets in one dataframe. They will need to be integrated separately into the training and testing datasets. This will make more sense in a bit. This process starts by making definitions_tbl into a list.
This concept is new - read and study carefully!
split splits a dataframe into multiple dataframes contained within a list. Supply a column name as a vector (meaning .$column_name). Therefore, Education, EnvirnmentSatisfaction and JobInvolvement will be separated into individual dataframes. Example:definitions_list[[3]]
# A tibble: 4 x 3
column_name key value
<chr> <dbl> <chr>
1 JobInvolvement 1 Low
2 JobInvolvement 2 Medium
3 JobInvolvement 3 High
4 JobInvolvement 4 Very High
map and an anonymous function (the ~) to remove `column_name from each of the dataframes stored in the newly created list
> definitions_list[[3]]
# A tibble: 4 x 2
key value
<dbl> <chr>
1 1 Low
2 2 Medium
3 3 High
4 4 Very Highvalue into a factordefinitions_list <- definitions_tbl %>%
split(.$column_name) %>% #split df into multiple dfs within a list
map(~ select(., -column_name)) %>% # remove column_name col leaving key and value
map(~ mutate(., value = as_factor(value)))# Creates factors in the order they appear - not alphabetically like as.factor
definitions_list[[3]]## # A tibble: 4 x 2
## key value
## <dbl> <fct>
## 1 1 Low
## 2 2 Medium
## 3 3 High
## 4 4 Very High
IMPORTANT: forcats::as_factor is different from base::as.factor. The former creates factor in the order in which they appear as opposed to alphabetically with base::as.factor
key and value to something like Education and Education_value
list_namelist_name and making the value column into list_name pasted with _value - really not that hard!for(i in seq_along(definitions_list)){
list_name <- names(definitions_list)[i]
colnames(definitions_list[[i]]) <- c(list_name, paste0(list_name, "_value"))}
definitions_list[[3]]## # A tibble: 4 x 2
## JobInvolvement JobInvolvement_value
## <dbl> <fct>
## 1 1 Low
## 2 2 Medium
## 3 3 High
## 4 4 Very High
This can now be merged with the training dataset!
Iteratively join the dataframes within the definitions list with the main dataframe (training data)
This is new - study and learn!
train_raw_tbl at list and call it HRData - weird, huh!
append adds elements to a vector or list
purrrreduce which iteratively applies a user specified function to successive binary sets of objects. A 3-element vector would have a function applied to the first 2 elements and that output would then have the function applied with the 3rd element
EducationField joined on the Education numerical value - human readable!Education since we have just joined the more descriptive fields like Education_value
one_of to remove themEducation_value to a more useful nomenclature
set_names from base to do this leverage str_replace_all to remove _valuedata_merged_tbl <- list(HRData = train_raw_tbl) %>% append(definitions_list, after = 1) %>%
reduce(left_join) %>%
select(-one_of(names(definitions_list))) %>%
set_names(str_replace_all(names(.), pattern = "_value", replacement = "")) %>%
select(sort(names(.)))
glimpse(data_merged_tbl)## Observations: 1,250
## Variables: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No",...
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently",...
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department <chr> "Sales", "Research & Development", "R...
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education <fct> College, Below College, College, Mast...
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Ot...
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction <fct> Medium, High, Very High, Very High, L...
## $ Gender <chr> "Female", "Male", "Male", "Female", "...
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement <fct> High, Medium, Medium, High, High, Hig...
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole <chr> "Sales Executive", "Research Scientis...
## $ JobSatisfaction <fct> Very High, Medium, High, High, Medium...
## $ MaritalStatus <chr> "Single", "Married", "Single", "Marri...
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No"...
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating <fct> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <fct> Low, Very High, Medium, High, Very Hi...
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance <fct> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
There are still some character variables:
## Observations: 1,250
## Variables: 9
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No...
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel_R...
## $ Department <chr> "Sales", "Research & Development", "Research & ...
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "Lif...
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", "Ma...
## $ JobRole <chr> "Sales Executive", "Research Scientist", "Labor...
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "Marr...
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "...
Need to transformed into factors. Evaluate if the features are ordered.
## # A tibble: 3 x 1
## BusinessTravel
## <chr>
## 1 Travel_Rarely
## 2 Travel_Frequently
## 3 Non-Travel
The natural order should be: Non_Travel –> Travel_Rarely –> Travel_Frequently. Change it!
Evaluate the levels of the factors:
# Reorder some factors
data_merged_tbl %>% mutate_if(is.character, as.factor) %>% select_if(is.factor) %>% map(levels) %>% head()## $Attrition
## [1] "No" "Yes"
##
## $BusinessTravel
## [1] "Non-Travel" "Travel_Frequently" "Travel_Rarely"
##
## $Department
## [1] "Human Resources" "Research & Development"
## [3] "Sales"
##
## $Education
## [1] "Below College" "College" "Bachelor" "Master"
## [5] "Doctor"
##
## $EducationField
## [1] "Human Resources" "Life Sciences" "Marketing"
## [4] "Medical" "Other" "Technical Degree"
##
## $EnvironmentSatisfaction
## [1] "Low" "Medium" "High" "Very High"
BusinessTravel and MartitalStatus are the only factor that need to be re-leveled.
data_processed_tbl <- data_merged_tbl %>% mutate_if(is.character, as.factor) %>%
mutate(BusinessTravel = BusinessTravel %>% fct_relevel("Non-Travel", "Travel_Rarely", "Travel_Frequently"),
MaritalStatus = MaritalStatus %>% fct_relevel("Single", "Married", "Divorced"))
data_processed_tbl %>% select(BusinessTravel, MaritalStatus) %>% map(levels)## $BusinessTravel
## [1] "Non-Travel" "Travel_Rarely" "Travel_Frequently"
##
## $MaritalStatus
## [1] "Single" "Married" "Divorced"
The pipeline below assumes the source files meet specific formats. It is unlikely that other company data will be different. Modify the code below as needed to fit your specific file format needs.
process_hr_data_readable <- function(data, definitions_tbl){
definitions_list <- definitions_tbl %>% fill(X__1, .direction = "down") %>%
filter(!is.na(X__2)) %>%
separate(X__2, into = c("key", "value"), sep = " '", remove = TRUE) %>%
rename(column_name = X__1) %>%
mutate(key = as.numeric(key)) %>%
mutate(value = value %>% str_replace(pattern = "'", replacement = "")) %>%
split(.$column_name) %>%
map(~ select(., -column_name)) %>%
map(~ mutate(., value = as_factor(value)))
for(i in seq_along(definitions_list)){
list_name <- names(definitions_list)[i]
colnames(definitions_list[[i]]) <- c(list_name, paste0(list_name, "_value")) }
data_merged_tbl <- list(HRData = data) %>% append(definitions_list, after = 1) %>%
reduce(left_join) %>%
select(-one_of(names(definitions_list))) %>%
set_names(str_replace_all(names(.), pattern = "_value", replacement = "")) %>%
select(sort(names(.))) %>%
mutate_if(is.character, as.factor) %>%
mutate(BusinessTravel = BusinessTravel %>% fct_relevel("Non-Travel", "Travel_Rarely", "Travel_Frequently"),
MaritalStatus = MaritalStatus %>% fct_relevel("Single", "Married", "Divorced"))
return(data_merged_tbl)
}This chapter is about preparing data for correlation analysis. Correlation Analysis is a great way to determine if you are getting good features prior to modeling. Always start with a correlation analysis.
To inspect feature distributions and identify transformations needed to make them properly formatted for correlation analysis.
# Plot Faceted Histogram Function ----
plot_hist_facet <- function(data, bins = 10, ncol = 5, fct_reorder = FALSE, fct_rev = FALSE,
fill = palette()[[3]], color = "white", scale = "free"){
data_factored <- data %>% mutate_if(is.character, as.factor) %>% mutate_if(is.factor, as.numeric) %>%
gather(key = key, value = value, factor_key = TRUE)
if(fct_reorder){data_factored <- data_factored %>% mutate(key = as.character(key) %>% as.factor())}
if(fct_rev){data_factored <- data_factored %>% mutate(key = fct_rev(key))}
g <- data_factored %>% ggplot(aes(x = value, group = key)) + geom_histogram(bins = bins,
fill = fill, color = color) + facet_wrap(~ key, ncol = ncol, scale = scale) + theme_tq()
return(g)}Here is the output as an example:
Useful to make the label the first plot
The idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. feature engineering).
In recipes, there are no constraints related to the order in which steps are added to the recipe. However, there are some general suggestions that you should consider:
Here is a suggested order of potential steps that should work for most problems:
For our data, here is the recipe plan we will implement:
We do not have missing values in the current data. However there are features with no variance.
recipes: Setup
First create a recipe from the original data and then specify the processing steps. Recipes can be created manually by sequentially adding roles to variables in a data set. If the analysis only required outcomes and predictors, the easiest way to create the initial recipe is to use the standard formula method:
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 34
The data contained in the data argument need not be the training set; this data is only used to catalog the names of the variables and their types (e.g. numeric, etc.).
recipes: Zero Variance Features
To add a step to remove zero variance features, it is simple to do:
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 34
##
## Operations:
##
## Zero variance filter on all_predictors()
prep() prepares the recipe performing preliminary calculations. Useful to see what the step is doing.
prep() does not transform the data - it just determines is is needed to transformbake() performs transformation in new data (even if it is the same data you have already used) - bake takes as its newdata argument the name of the dataset typically used in the setup: recipe_obj <- recipe(Attrition ~., data = train_readable_tbl). Useful to apply the recipe to any data set (commonly test and train)prep())## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 34
##
## Training data contained 1250 data points and no missing data.
##
## Operations:
##
## Zero variance filter removed EmployeeCount, ... [trained]
It is helpful to know 3 features will be removed when the recipe is implemented.
recipes: Transformations
Identify which features have high skewness. Could use the facet plot illustrate above or compute using code (helpful when you have many features):
High Value –> fat tail on left | Low Value –> fat tail on right
skewness is a function from PerformanceAnalytics (package is loaded with tidyquant) computes the skewness of an univariate distribution.
factor_keypreserves the order of the variables
## # A tibble: 19 x 2
## key value
## <fct> <dbl>
## 1 YearsSinceLastPromotion 1.94
## 2 YearsAtCompany 1.76
## 3 MonthlyIncome 1.35
## 4 TotalWorkingYears 1.11
## 5 NumCompaniesWorked 1.02
## 6 JobLevel 1.02
## 7 DistanceFromHome 0.985
## 8 StockOptionLevel 0.968
## 9 YearsInCurrentRole 0.933
## 10 PercentSalaryHike 0.802
## 11 YearsWithCurrManager 0.802
## 12 TrainingTimesLastYear 0.557
## 13 Age 0.422
## 14 EmployeeNumber 0.0121
## 15 MonthlyRate 0.0100
## 16 DailyRate 0.00578
## 17 HourlyRate -0.0263
## 18 EmployeeCount NaN
## 19 StandardHours NaN
Judgment is used to determine a cutoff for what is/not skewed. Evaluate the values from above:
YearsWithCurrManager = 0.802 and TrainingTimesLastYear = 0.55The facet plot indicates this too. as.character is used so we can plot. . . .
skewed_feature_names <- train_readable_tbl %>% select_if(is.numeric) %>% map_df(skewness) %>%
gather(factor_key = TRUE) %>% arrange(desc(value)) %>% filter(value >= 0.8) %>%
pull(key) %>% as.character()
skewed_feature_names## [1] "YearsSinceLastPromotion" "YearsAtCompany"
## [3] "MonthlyIncome" "TotalWorkingYears"
## [5] "NumCompaniesWorked" "JobLevel"
## [7] "DistanceFromHome" "StockOptionLevel"
## [9] "YearsInCurrentRole" "PercentSalaryHike"
## [11] "YearsWithCurrManager"
With this information, plot the features:
The plot shows
JobLevelandStockOptionLevelare factors and therefore are not transformed here (wait for dummying)
factor_names <- c("JobLevel", "StockOptionLevel")
skewed_feature_names <- train_readable_tbl %>% select_if(is.numeric) %>% map_df(skewness) %>%
gather(factor_key = TRUE) %>% arrange(desc(value)) %>% filter(value >= 0.8) %>%
filter(!key %in% factor_names) %>% pull(key) %>% as.character()We now have the information we need to continue building our recipe:
recipe_obj <- recipe(Attrition ~., data = train_readable_tbl) %>% step_zv(all_predictors()) %>%
step_YeoJohnson(skewed_feature_names) %>%
step_num2factor(factor_names)
recipe_obj %>% prep()## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 34
##
## Training data contained 1250 data points and no missing data.
##
## Operations:
##
## Zero variance filter removed EmployeeCount, ... [trained]
## Yeo-Johnson transformation on 9 items [trained]
## Factor variables from JobLevel, StockOptionLevel [trained]
step_YeoJohnson - power transformation. Instead of sqrt, looks for diff roots b/t -5 to 5 to eliminate skew
To make sure the recipe worked, facet plot the numeric features that were the target of the YeoJohnson transformation:
Skewness is significantly reduced!
recipes: center & scale
Required for algos that require scaling including kmeans, Deep Learning, PCA, SVMs.
Scaling does not typically hurt algos - so do it
Note the x axis values are different from one and other - different ranges.
DailyRatewould likely dominate!
Must do in order: Center then Scale (C before S)
Note - H2O does not require centering and scaling - more on this later.
recipe_obj <- recipe(Attrition ~., data = train_readable_tbl) %>% step_zv(all_predictors()) %>%
step_YeoJohnson(skewed_feature_names) %>% step_num2factor(factor_names) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
recipe_obj## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 34
##
## Operations:
##
## Zero variance filter on all_predictors()
## Yeo-Johnson transformation on skewed_feature_names
## Factor variables from factor_names
## Centering for all_numeric()
## Scaling for all_numeric()
The output from the recipe object is helpful but we are interested in understanding what it is really doing behind the scenes:
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 34
##
## Training data contained 1250 data points and no missing data.
##
## Operations:
##
## Zero variance filter removed EmployeeCount, ... [trained]
## Yeo-Johnson transformation on 9 items [trained]
## Factor variables from JobLevel, StockOptionLevel [trained]
## Centering for Age, DailyRate, ... [trained]
## Scaling for Age, DailyRate, ... [trained]
All the x-axis scales are updated for the independent features!
recipes: dummy variables
One hot encoding - expands categorical variables into separate columns of 1/0. Important to detect patterns in unordered data.
all_nominal()selects only categorical/factor data
To see what output we will get, evaluate only JobRole as our example:
dummied_recipe_obj <- recipe(Attrition ~., data = train_readable_tbl) %>% step_zv(all_predictors()) %>% step_YeoJohnson(skewed_feature_names) %>%
step_num2factor(factor_names) %>% step_center(all_numeric()) %>% step_scale(all_numeric()) %>%
step_dummy(all_nominal()) %>% prep() %>% bake(newdata = train_readable_tbl) %>% select(contains("JobRole")) %>% plot_hist_facet(ncol = 3)JobRole has been expanded to 8 columns (1 less than the number of unique values in JobRole. Healthcare Representative Manager was left out).
recipe::step_dummywas updated with a new option -one_hot: A logical. For C levels, should C dummy variables be created rather than C-1? Very useful for plotting!
recipe_obj <- recipe(Attrition ~., data = train_readable_tbl) %>% step_zv(all_predictors()) %>% step_YeoJohnson(skewed_feature_names) %>%
step_num2factor(factor_names) %>% step_center(all_numeric()) %>% step_scale(all_numeric()) %>%
step_dummy(all_nominal())recipes: Final Recipe
Transform the data. . .. first review what the recipe will do . . . .
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 34
##
## Training data contained 1250 data points and no missing data.
##
## Operations:
##
## Zero variance filter removed EmployeeCount, ... [trained]
## Yeo-Johnson transformation on 9 items [trained]
## Factor variables from JobLevel, StockOptionLevel [trained]
## Centering for Age, DailyRate, ... [trained]
## Scaling for Age, DailyRate, ... [trained]
## Dummy variables from BusinessTravel, Department, ... [trained]
All of this effort is so we can perform a correlation analysis.
Note that the recipe changes our target label:
## [1] "numeric"
Note: Good introduction found here. May also help expand this to include non-linear relationships.
Correlation analysis only identifies linear relationships - if there is an exponential relationship
(A random forest will find the non-linear relationship.)
If there are no features with relatively strong correlation, it might be wise to collect more data.
Correlation is a good barometer but not necessarily a definitive answer - it is a guide.
A function to take data and measure correlation.
pairwise.complete.obs is not the default but it is what you nearly always want to use
quo_name changes the stored value to “Attrition_Yes” from a quosuremutate_if(is.character, as.factor) %>% mutate_if(is.factor, as.numeric) prevents function from failing is there is character datacor from stats returns a square correlation df correlating every feature against all the others. The return is square meaning the \(number of rows = number of columns\)
use is an optional character string giving a method for computing covariances in the presence of missing values.
(.) in the names function simply represents the data in the function. This creates a new character column called feature. this makes is square.select(feature, !! feature_expr) produces a tibble with 2 columns - the new feature column and the Attrition_Yes valuesfilter(!(feature == feature_name)) simply removes the value in the feature column that equals Attrition_Yes - it simply is not needed/redundantas_factor maintains the order in which the values appear - not alphabetizedfct_reorder reorders a factor by another column. In this case, reorder by the Attrition_Yes value. Helpful! Requires arrange() to update the df.fct_rev reverses the ordering from the previous line of code
fct_reorderandfct_revis useful for plotting
get_cor <- function(data, target, use = "pairwise.complete.obs", fct_reorder = FALSE, fct_rev = FALSE){
feature_expr <- enquo(target)
feature_name <- quo_name(feature_expr)
data_cor <- data %>% mutate_if(is.character, as.factor) %>% mutate_if(is.factor, as.numeric) %>%
cor(use = use) %>% as.tibble() %>% mutate(feature = names(.)) %>%
select(feature, !! feature_expr) %>% filter(!(feature == feature_name)) %>%
mutate_if(is.character, as_factor)
if(fct_reorder){data_cor <- data_cor %>% mutate(feature = fct_reorder(feature,
!! feature_expr)) %>% arrange(feature)}
if(fct_rev){data_cor <- data_cor %>% mutate(feature = fct_rev(feature)) %>% arrange(feature)}
return(data_cor)}Example of get_cor()
## Warning in stats::cor(x, ...): the standard deviation is zero
## # A tibble: 65 x 2
## feature Attrition_Yes
## <fct> <dbl>
## 1 PerformanceRating_Good NA
## 2 OverTime_Yes 0.240
## 3 JobRole_Sales.Representative 0.153
## 4 BusinessTravel_Travel_Frequently 0.103
## 5 Department_Sales 0.0891
## 6 DistanceFromHome 0.0822
## 7 JobRole_Laboratory.Technician 0.0737
## 8 EducationField_Technical.Degree 0.0731
## 9 JobRole_Human.Resources 0.0718
## 10 JobInvolvement_Medium 0.0604
## # ... with 55 more rows
Multicolinearity is when two features in a model are highly correlated because it relates to whether or not the coefficients are uniquely identified. This relates to explanatory power of the model.
The goals are different for machine learning and traditional statistics, which is why multicolinearity is not traditionally checked for in ML but is in statistics.
Machine Learning: Assessing multicolinearity is not typically a concern in machine learning where the goal is to get a high performance prediction and more features helps in this pursuit. ML models can be linear or nonlinear and typically have regularization which fights over fitting, a situation that can be caused by many features that are correlated. This enables extreme accuracy, but the feature weights (coefficients) get lost, which reduces the explainability of the model.
In machine learning, we will have hundreds of features, but a few of the features will likely dominate for any given observation. This is where LIME comes in. LIME is a tool that allows us to cut through to the core of the features that are driving the explanation for any given model. It approximates nonlinear models locally, which enables us to detect which features are driving the prediction. It’s not impervious to colinear features but this is why we need to understand what features are going into a model from a business-intuition standpoint and discuss features with Subject Matter Experts. See this resource for more discussion on the LIME topic: https://github.com/marcotcr/lime/issues/178
Traditional Statistics: In traditional statistics it’s very common to inspect for multicolinearity and remove features. This is because the goal is to build a model using few features that explain the majority of the variance. When removing features, we can assess multicolinearity
The Debate: Because of the two sides, it’s debatable and it depends on your goals of high performance vs ability to explain with as few features as possible.
My Opinion: When it comes to modeling, my opinion is that we should strive for a high performing model once we know we have good features (features that have some predictive power based on a correlation assessment). When it comes to selecting features, we should meet with Subject Matter Experts and pick reasonable features to build a model around doing our best to not include the same feature twice, which is often where multicolinearity arises from (e.g. including birth year and age in the same model). When it comes to explaining, we should use tools like LIME that have the ability to hone in on the features that are contributing most to the prediction.
For more on this topic: https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-checked-in-modern-statistics-machine-learning
The data produced by get_cor() is terrific, but understanding the data quickly is accomplished by creating a visualization.
The function below produces an informative visualization of the get_cor() output. Do not be overwhelmed by the function. Many of the parameters are simply ggplot parameters.
get_cor()feature_name_text is simply the Attriiotion_Yes value rounded to the number of digits specified by lbl_precisionCorrelation is created based on the value of Attrition_Yes and outputs a factor = Positive or Negativeggplot code to understand how it works. (First set values for the function parameters.)plot_cor <- function(data, target, fct_reorder = FALSE, fct_rev = FALSE, include_lbl = TRUE, lbl_precision = 2, lbl_position = "outward",
size = 2, line_size = 1, vert_size = 1, color_pos = palette_light()[[1]], color_neg = palette_light()[[2]]) {
feature_expr <- enquo(target)
feature_name <- quo_name(feature_expr)
data_cor <- data %>%
get_cor(!! feature_expr, fct_reorder = fct_reorder, fct_rev = fct_rev) %>%
mutate(feature_name_text = round(!! feature_expr, lbl_precision)) %>%
mutate(Correlation = case_when((!! feature_expr) >= 0 ~ "Positive", TRUE ~ "Negative") %>% as.factor())
g <- data_cor %>% ggplot(aes_string(x = feature_name, y = "feature", group = "feature")) +
geom_point(aes(color = Correlation), size = size) +
geom_segment(aes(xend = 0, yend = feature, color = Correlation), size = line_size) +
geom_vline(xintercept = 0, color = palette_light()[[1]], size = vert_size) +
expand_limits(x = c(-1, 1)) + theme_tq() + scale_color_manual(values = c(color_neg, color_pos))
if (include_lbl) g <- g + geom_label(aes(label = feature_name_text), hjust = lbl_position)
return(g)}
train_tbl %>% plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)plot_cor OutputData have been prepared and we developed a visualization function to easily evaluate feature correlations.
train_tbl %>% select(Attrition_Yes, Age, contains("Gender"), contains("MaritalStatus"), NumCompaniesWorked, contains("Over18"), DistanceFromHome) %>%
plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)Observations:
train_tbl %>% select(Attrition_Yes, contains("employee"), contains("department"), contains("job")) %>%
plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)Observations:
train_tbl %>% select(Attrition_Yes, contains("income"), contains("rate"), contains("salary"),
contains("stock")) %>%
plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)Observations:
-MonhtlyIncome StockOptionLevel are impactful
train_tbl %>% select(Attrition_Yes, contains("satisfaction"), contains("life")) %>%
plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)train_tbl %>% select(Attrition_Yes, contains("performance"), contains("involvement")) %>%
plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)train_tbl %>% select(Attrition_Yes, contains("overtime"), contains("travel")) %>%
plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)train_tbl %>% select(Attrition_Yes, contains("training"), contains("education")) %>%
plot_cor(target = Attrition_Yes, fct_reorder = T, fct_rev = F)train_tbl %>% select(Attrition_Yes, contains("years")) %>% plot_cor(target = Attrition_Yes,
fct_reorder = T, fct_rev = F)A few random H2O notes:
H2O has great documentation.
Update Since Original Modeling The leaderboard_frame is a legacy argument that is not necessary and the data is better served being used for the training and/or validation. Therefore, you should only perform one split (training and validation) as opposed to two (training, validation, test - used for leaderboard_frame). This should increase model performance on the data set. You can try it both ways to see what happens. The leaderboard rankings performed later will be made on the Cross Validation metrics.
The main issue is that it reduces the amount of training data unnecessarily by using the leaderboard. The leaderboard used to be used for creating the Stacked Ensemble models. Now it is a legacy because of using the cross validation as a ranking mechanism. So what happens is that if we split off a set for the leaderboard, then we lose data that could have been used for training. Instead of doing the 70% training, 15% validation (test), and 15% leaderboard (test), we can now combine the leaderboard and training sets to get a higher accuracy model and just ranking by Cross Validation, which is performed on the training set internally within h2o.automl(). The split should be between 85% training and 15% validation. Leave the leaderboard = NULL, which is the preferred approach now.
h2o.init()
h2o.no_progress()
split_h2o <- h2o.splitFrame(as.h2o(train_tbl), ratios = c(0.85), seed = 1234)
train_h2o <- split_h2o[[1]]; valid_h2o <- split_h2o[[2]]
test_h2o <- as.h2o(test_tbl)
y <- "Attrition"; x <- setdiff(names(train_h2o), y)
automl_models_h2o <- h2o.automl(x = x, y = y, training_frame = train_h2o, validation_frame = valid_h2o,
leaderboard_frame = NULL, max_runtime_secs = 30, nfolds = 5)There is much data produced by the individual models - the leader is shown below:
## Model Details:
## ==============
##
## H2OBinomialModel: glm
## Model ID: GLM_grid_0_AutoML_20181028_212737_model_0
## GLM Model: summary
## family link regularization
## 1 binomial logit Ridge ( lambda = 0.0042 )
## lambda_search
## 1 nlambda = 30, lambda.max = 6.2463, lambda.min = 0.0042, lambda.1se = 0.02055
## number_of_predictors_total number_of_active_predictors
## 1 79 79
## number_of_iterations training_frame
## 1 54 _804567d077a2a0701a19fb84ffc740ad
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept 0.849792 -1.417281
## 2 JobRole.Healthcare Representative 0.066056 0.066056
## 3 JobRole.Human Resources 0.167535 0.167535
## 4 JobRole.Laboratory Technician 0.294450 0.294450
## 5 JobRole.Manager -0.188477 -0.188477
##
## ---
## names coefficients standardized_coefficients
## 75 TotalWorkingYears -0.042385 -0.330973
## 76 TrainingTimesLastYear -0.180422 -0.231657
## 77 YearsAtCompany 0.102775 0.649596
## 78 YearsInCurrentRole -0.107887 -0.399926
## 79 YearsSinceLastPromotion 0.098246 0.321247
## 80 YearsWithCurrManager -0.118166 -0.425507
##
## H2OBinomialMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.077
## RMSE: 0.28
## LogLoss: 0.27
## Mean Per-Class Error: 0.19
## AUC: 0.9
## Gini: 0.8
## R^2: 0.44
## Residual Deviance: 573
## AIC: 733
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 838 51 0.057368 =51/889
## Yes 58 118 0.329545 =58/176
## Totals 896 169 0.102347 =109/1065
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.348528 0.684058 128
## 2 max f2 0.173866 0.739300 220
## 3 max f0point5 0.517226 0.767857 79
## 4 max accuracy 0.517226 0.906103 79
## 5 max precision 0.962484 1.000000 0
## 6 max recall 0.006173 1.000000 388
## 7 max specificity 0.962484 1.000000 0
## 8 max absolute_mcc 0.348528 0.623206 128
## 9 max min_per_class_accuracy 0.189074 0.825647 210
## 10 max mean_per_class_accuracy 0.218153 0.836509 190
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on validation data. **
##
## MSE: 0.086
## RMSE: 0.29
## LogLoss: 0.3
## Mean Per-Class Error: 0.24
## AUC: 0.81
## Gini: 0.62
## R^2: 0.26
## Residual Deviance: 112
## AIC: 272
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 148 12 0.075000 =12/160
## Yes 10 15 0.400000 =10/25
## Totals 158 27 0.118919 =22/185
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.314235 0.576923 26
## 2 max f2 0.167744 0.620915 52
## 3 max f0point5 0.648042 0.612245 5
## 4 max accuracy 0.648042 0.897297 5
## 5 max precision 0.933029 1.000000 0
## 6 max recall 0.010212 1.000000 158
## 7 max specificity 0.933029 1.000000 0
## 8 max absolute_mcc 0.314235 0.508369 26
## 9 max min_per_class_accuracy 0.167744 0.760000 52
## 10 max mean_per_class_accuracy 0.167744 0.773750 52
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.094
## RMSE: 0.31
## LogLoss: 0.32
## Mean Per-Class Error: 0.24
## AUC: 0.85
## Gini: 0.69
## R^2: 0.32
## Residual Deviance: 684
## AIC: 844
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 799 90 0.101237 =90/889
## Yes 65 111 0.369318 =65/176
## Totals 864 201 0.145540 =155/1065
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.303861 0.588859 155
## 2 max f2 0.192745 0.659068 208
## 3 max f0point5 0.481045 0.651408 86
## 4 max accuracy 0.481045 0.881690 86
## 5 max precision 0.949434 1.000000 0
## 6 max recall 0.002279 1.000000 396
## 7 max specificity 0.949434 1.000000 0
## 8 max absolute_mcc 0.359402 0.511298 127
## 9 max min_per_class_accuracy 0.173231 0.772727 219
## 10 max mean_per_class_accuracy 0.192745 0.781103 208
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid
## accuracy 0.87323946 0.02347418 0.84976524 0.82159626 0.90140843
## auc 0.8425771 0.026760576 0.8296127 0.8300225 0.9127907
## err 0.12676056 0.02347418 0.15023474 0.17840375 0.09859155
## err_count 27.0 5.0 32.0 38.0 21.0
## f0point5 0.63031274 0.07437051 0.6377551 0.45064378 0.7763975
## cv_4_valid cv_5_valid
## accuracy 0.91079813 0.8826291
## auc 0.7989362 0.84152335
## err 0.089201875 0.11737089
## err_count 19.0 25.0
## f0point5 0.6179775 0.6687898
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## precision 0.6580986 0.096957624 0.65789473 0.4117647 0.8333333
## r2 0.30837095 0.061606336 0.2931284 0.25788376 0.4637525
## recall 0.5819287 0.06444905 0.5681818 0.7241379 0.6097561
## residual_deviance 135.58003 11.968308 165.20996 127.08945 117.54124
## rmse 0.30468094 0.014051158 0.34037712 0.2954366 0.28870824
## specificity 0.9306463 0.035521794 0.9230769 0.8369565 0.9709302
## cv_4_valid cv_5_valid
## precision 0.6875 0.7
## r2 0.20434217 0.32274798
## recall 0.44 0.5675676
## residual_deviance 125.3069 142.75262
## rmse 0.28709954 0.31178322
## specificity 0.9734042 0.9488636
H2O even provides a leaderboard - how cool is that?
## model_id auc logloss
## 1 GLM_grid_0_AutoML_20181028_212737_model_0 0.85 0.32
## 2 StackedEnsemble_BestOfFamily_0_AutoML_20181028_212737 0.84 0.32
## 3 StackedEnsemble_AllModels_0_AutoML_20181028_212737 0.84 0.32
## 4 GBM_grid_0_AutoML_20181028_212737_model_4 0.82 0.34
## 5 GBM_grid_0_AutoML_20181028_212737_model_5 0.80 0.36
## 6 GBM_grid_0_AutoML_20181028_212737_model_1 0.80 0.36
## mean_per_class_error rmse mse
## 1 0.24 0.31 0.094
## 2 0.23 0.31 0.093
## 3 0.24 0.30 0.093
## 4 0.25 0.32 0.103
## 5 0.28 0.33 0.109
## 6 0.27 0.33 0.107
##
## [21 rows x 6 columns]
Getting the leader is pretty simple:
## Model Details:
## ==============
##
## H2OBinomialModel: glm
## Model ID: GLM_grid_0_AutoML_20181028_212737_model_0
## GLM Model: summary
## family link regularization
## 1 binomial logit Ridge ( lambda = 0.0042 )
## lambda_search
## 1 nlambda = 30, lambda.max = 6.2463, lambda.min = 0.0042, lambda.1se = 0.02055
## number_of_predictors_total number_of_active_predictors
## 1 79 79
## number_of_iterations training_frame
## 1 54 _804567d077a2a0701a19fb84ffc740ad
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept 0.849792 -1.417281
## 2 JobRole.Healthcare Representative 0.066056 0.066056
## 3 JobRole.Human Resources 0.167535 0.167535
## 4 JobRole.Laboratory Technician 0.294450 0.294450
## 5 JobRole.Manager -0.188477 -0.188477
##
## ---
## names coefficients standardized_coefficients
## 75 TotalWorkingYears -0.042385 -0.330973
## 76 TrainingTimesLastYear -0.180422 -0.231657
## 77 YearsAtCompany 0.102775 0.649596
## 78 YearsInCurrentRole -0.107887 -0.399926
## 79 YearsSinceLastPromotion 0.098246 0.321247
## 80 YearsWithCurrManager -0.118166 -0.425507
##
## H2OBinomialMetrics: glm
## ** Reported on training data. **
##
## MSE: 0.077
## RMSE: 0.28
## LogLoss: 0.27
## Mean Per-Class Error: 0.19
## AUC: 0.9
## Gini: 0.8
## R^2: 0.44
## Residual Deviance: 573
## AIC: 733
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 838 51 0.057368 =51/889
## Yes 58 118 0.329545 =58/176
## Totals 896 169 0.102347 =109/1065
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.348528 0.684058 128
## 2 max f2 0.173866 0.739300 220
## 3 max f0point5 0.517226 0.767857 79
## 4 max accuracy 0.517226 0.906103 79
## 5 max precision 0.962484 1.000000 0
## 6 max recall 0.006173 1.000000 388
## 7 max specificity 0.962484 1.000000 0
## 8 max absolute_mcc 0.348528 0.623206 128
## 9 max min_per_class_accuracy 0.189074 0.825647 210
## 10 max mean_per_class_accuracy 0.218153 0.836509 190
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on validation data. **
##
## MSE: 0.086
## RMSE: 0.29
## LogLoss: 0.3
## Mean Per-Class Error: 0.24
## AUC: 0.81
## Gini: 0.62
## R^2: 0.26
## Residual Deviance: 112
## AIC: 272
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 148 12 0.075000 =12/160
## Yes 10 15 0.400000 =10/25
## Totals 158 27 0.118919 =22/185
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.314235 0.576923 26
## 2 max f2 0.167744 0.620915 52
## 3 max f0point5 0.648042 0.612245 5
## 4 max accuracy 0.648042 0.897297 5
## 5 max precision 0.933029 1.000000 0
## 6 max recall 0.010212 1.000000 158
## 7 max specificity 0.933029 1.000000 0
## 8 max absolute_mcc 0.314235 0.508369 26
## 9 max min_per_class_accuracy 0.167744 0.760000 52
## 10 max mean_per_class_accuracy 0.167744 0.773750 52
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.094
## RMSE: 0.31
## LogLoss: 0.32
## Mean Per-Class Error: 0.24
## AUC: 0.85
## Gini: 0.69
## R^2: 0.32
## Residual Deviance: 684
## AIC: 844
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 799 90 0.101237 =90/889
## Yes 65 111 0.369318 =65/176
## Totals 864 201 0.145540 =155/1065
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.303861 0.588859 155
## 2 max f2 0.192745 0.659068 208
## 3 max f0point5 0.481045 0.651408 86
## 4 max accuracy 0.481045 0.881690 86
## 5 max precision 0.949434 1.000000 0
## 6 max recall 0.002279 1.000000 396
## 7 max specificity 0.949434 1.000000 0
## 8 max absolute_mcc 0.359402 0.511298 127
## 9 max min_per_class_accuracy 0.173231 0.772727 219
## 10 max mean_per_class_accuracy 0.192745 0.781103 208
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid
## accuracy 0.87323946 0.02347418 0.84976524 0.82159626 0.90140843
## auc 0.8425771 0.026760576 0.8296127 0.8300225 0.9127907
## err 0.12676056 0.02347418 0.15023474 0.17840375 0.09859155
## err_count 27.0 5.0 32.0 38.0 21.0
## f0point5 0.63031274 0.07437051 0.6377551 0.45064378 0.7763975
## cv_4_valid cv_5_valid
## accuracy 0.91079813 0.8826291
## auc 0.7989362 0.84152335
## err 0.089201875 0.11737089
## err_count 19.0 25.0
## f0point5 0.6179775 0.6687898
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## precision 0.6580986 0.096957624 0.65789473 0.4117647 0.8333333
## r2 0.30837095 0.061606336 0.2931284 0.25788376 0.4637525
## recall 0.5819287 0.06444905 0.5681818 0.7241379 0.6097561
## residual_deviance 135.58003 11.968308 165.20996 127.08945 117.54124
## rmse 0.30468094 0.014051158 0.34037712 0.2954366 0.28870824
## specificity 0.9306463 0.035521794 0.9230769 0.8369565 0.9709302
## cv_4_valid cv_5_valid
## precision 0.6875 0.7
## r2 0.20434217 0.32274798
## recall 0.44 0.5675676
## residual_deviance 125.3069 142.75262
## rmse 0.28709954 0.31178322
## specificity 0.9734042 0.9488636
To evaluate a model other than the leader, you must use the model_id in the leaderboard:
The code above is commented only because when the RMD is knitted, different models will be built and the model_id will change.
extract_h2o_model_name_by_position <- function(h2o_leaderboard, position = 1, verbose = TRUE){
model_name <- h2o_leaderboard %>% as.tibble() %>% slice(position) %>% pull(model_id)
if(verbose) message(model_name)
return(model_name)}# Test new function
extract_h2o_model_name_by_position(automl_models_h2o@leaderboard, position = 3, verbose = T) %>% h2o.getModel()## StackedEnsemble_AllModels_0_AutoML_20181028_212737
## Model Details:
## ==============
##
## H2OBinomialModel: stackedensemble
## Model ID: StackedEnsemble_AllModels_0_AutoML_20181028_212737
## NULL
##
##
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
##
## MSE: 0.051
## RMSE: 0.23
## LogLoss: 0.19
## Mean Per-Class Error: 0.13
## AUC: 0.96
## Gini: 0.93
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 863 26 0.029246 =26/889
## Yes 39 137 0.221591 =39/176
## Totals 902 163 0.061033 =65/1065
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.372597 0.808260 140
## 2 max f2 0.230229 0.826039 179
## 3 max f0point5 0.621124 0.860927 90
## 4 max accuracy 0.419274 0.938967 132
## 5 max precision 0.983435 1.000000 0
## 6 max recall 0.058457 1.000000 336
## 7 max specificity 0.983435 1.000000 0
## 8 max absolute_mcc 0.372597 0.772834 140
## 9 max min_per_class_accuracy 0.168999 0.889764 210
## 10 max mean_per_class_accuracy 0.230229 0.895794 179
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on validation data. **
##
## MSE: 0.083
## RMSE: 0.29
## LogLoss: 0.29
## Mean Per-Class Error: 0.21
## AUC: 0.82
## Gini: 0.65
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 143 17 0.106250 =17/160
## Yes 8 17 0.320000 =8/25
## Totals 151 34 0.135135 =25/185
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.225906 0.576271 33
## 2 max f2 0.092863 0.644172 62
## 3 max f0point5 0.850666 0.612245 5
## 4 max accuracy 0.850666 0.897297 5
## 5 max precision 0.982857 1.000000 0
## 6 max recall 0.041520 1.000000 146
## 7 max specificity 0.982857 1.000000 0
## 8 max absolute_mcc 0.225906 0.506436 33
## 9 max min_per_class_accuracy 0.105688 0.760000 55
## 10 max mean_per_class_accuracy 0.092863 0.788750 62
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.093
## RMSE: 0.3
## LogLoss: 0.32
## Mean Per-Class Error: 0.24
## AUC: 0.84
## Gini: 0.68
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 807 82 0.092238 =82/889
## Yes 67 109 0.380682 =67/176
## Totals 874 191 0.139906 =149/1065
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.278936 0.594005 157
## 2 max f2 0.150185 0.645325 219
## 3 max f0point5 0.505251 0.667808 88
## 4 max accuracy 0.505251 0.885446 88
## 5 max precision 0.982169 1.000000 0
## 6 max recall 0.033742 1.000000 393
## 7 max specificity 0.982169 1.000000 0
## 8 max absolute_mcc 0.437944 0.526086 102
## 9 max min_per_class_accuracy 0.113342 0.755682 253
## 10 max mean_per_class_accuracy 0.150185 0.774744 219
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Saving models is simple. First use h2o.getModel then save it by running h2o.saveModel. Easy! Below the code is not evaluated because each time the RMD is knitted, the models will change.
h2o.getModel("GLM_grid_0_AutoML_20181028_112225_model_0") %>%
h2o.saveModel(path = "04_Modeling/h2o_models/")
h2o.getModel("StackedEnsemble_BestOfFamily_0_AutoML_20181028_112225") %>%
h2o.saveModel(path = "04_Modeling/h2o_models/")Loaded as saved model is equally simple using h2o.loadModel:
getModelreturns a reference to an existing model in the H2O instance.loadModelloads a saved H2O model from disk.
Load a model:
Make predictions with the model using test_tbl and h2o.predict - again, simple!
predictions <- h2o.predict(stacked_ensemble_h2o, newdata = as.h2o(test_tbl))
predictions_tbl <- predictions %>% as.tibble()
predictions_tbl %>% head()## # A tibble: 6 x 3
## predict No Yes
## <fct> <dbl> <dbl>
## 1 No 0.981 0.0195
## 2 No 0.803 0.197
## 3 No 0.999 0.000884
## 4 No 0.974 0.0260
## 5 Yes 0.338 0.662
## 6 No 0.923 0.0766
For H2O Binary Classification, H2O provides 3 columns of data:
The models are trained on the training data and then sends the results to the validation data set and learns from the differences.
Repeated Note: The leaderboard_frame is a legacy argument that is not necessary and the data is better served being used for the training and/or validation. Therefore, you should only perform one split (training and validation) as opposed to two (training, validation, test - used for leaderboard_frame).
When extracting the model parameters, there sufficient information so the model can be run outside of the H2o framework.H2O automated ML model leaderboard returns a bunch of models. Each model can be extracted and the contents can be opened up to determine the exact model parameters. The only one that will be difficult to recreate is stacked ensemble since this is a combination of models. Others such as GLM, GBM, and Deep Learning can be inspected and recreated. The Bonus Lecture on Grid Search.
Use the @parameters option with the model:
## $model_id
## [1] "GLM_grid_0_AutoML_20181028_212737_model_0"
##
## $training_frame
## [1] "_804567d077a2a0701a19fb84ffc740ad"
##
## $validation_frame
## [1] "RTMP_sid_bdf4_5"
##
## $nfolds
## [1] 5
##
## $seed
## [1] -5728624559962048512
##
## $keep_cross_validation_predictions
## [1] TRUE
##
## $fold_assignment
## [1] "Modulo"
##
## $family
## [1] "binomial"
##
## $solver
## [1] "COORDINATE_DESCENT"
##
## $alpha
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
##
## $lambda
## [1] 6.2463 4.5467 3.3095 2.4090 1.7535 1.2763 0.9290 0.6762 0.4922 0.3583
## [11] 0.2608 0.1898 0.1382 0.1006 0.0732 0.0533 0.0388 0.0282 0.0206 0.0150
## [21] 0.0109 0.0079 0.0058 0.0042 0.0031 0.0022 0.0016
##
## $lambda_search
## [1] TRUE
##
## $nlambdas
## [1] 30
##
## $max_iterations
## [1] 300
##
## $objective_epsilon
## [1] 0.0001
##
## $gradient_epsilon
## [1] 0.000001
##
## $link
## [1] "logit"
##
## $lambda_min_ratio
## [1] 0.0001
##
## $max_active_predictors
## [1] 5000
##
## $obj_reg
## [1] 0.00094
##
## $max_runtime_secs
## [1] 4.3
##
## $x
## [1] "JobRole" "EducationField"
## [3] "JobLevel" "Education"
## [5] "EnvironmentSatisfaction" "JobInvolvement"
## [7] "JobSatisfaction" "RelationshipSatisfaction"
## [9] "StockOptionLevel" "WorkLifeBalance"
## [11] "Department" "BusinessTravel"
## [13] "MaritalStatus" "Gender"
## [15] "OverTime" "PerformanceRating"
## [17] "Age" "DailyRate"
## [19] "DistanceFromHome" "EmployeeNumber"
## [21] "HourlyRate" "MonthlyIncome"
## [23] "MonthlyRate" "NumCompaniesWorked"
## [25] "PercentSalaryHike" "TotalWorkingYears"
## [27] "TrainingTimesLastYear" "YearsAtCompany"
## [29] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [31] "YearsWithCurrManager"
##
## $y
## [1] "Attrition"
You can then select a parameter of interest:
## [1] 5
H2O provides complete access to the model parameters so you can recreate the model or tweak model parameters easily.
nfolds is K-fold validation. K-fold validation is used in model development to determine what parameters affect prediction performance by performing multiple sub-tests rather than just one test. The end result is parameter selections based on model performance stability that generalizes better with new data.
For example, for nfolds=5, 6 models are built. The first 5 models (cross-validation models) are built on 80% of the training data, and a different 20% is held out for each of the 5 models. Then the main model is built on 100% of the training data. This main model is the model you get back from H2O in R, Python and Flow (though the CV models are also stored and available to access later).
All 5 cross-validation models contain training metrics (from the 80% training data) and validation metrics (from their 20% holdout/validation data). To compute their individual validation metrics, each of the 5 cross-validation models had to make predictions on their 20% of of rows of the original training frame, and score against the true labels of the 20% holdout.
For the main model, this is how the cross-validation metrics are computed: The 5 holdout predictions are combined into one prediction for the full training dataset (i.e., predictions for every row of the training data, but the model making the prediction for a particular row has not seen that row during training). This “holdout prediction” is then scored against the true labels, and the overall cross-validation metrics are computed.
Each cv-model produces a prediction frame pertaining to its fold. It can be saved and probed from the various clients if keep_cross_validation_predictions parameter is set in the model constructor.
These holdout predictions have some interesting properties. First they have names like:
prediction_GBM_model_1452035702801_1_cv_1
and they contain, unsurprisingly, predictions for the data held out in the fold. They also have the same number of rows as the entire input training frame with 0s filled in for all rows that are not in the hold out.
Retrieve the AUC for a classifier with the xval argument to retrieve the average cross validation AUC:
## train valid xval
## 0.90 0.81 0.85
H2O supports two types of grid search – traditional (or “cartesian”) grid search and random grid search. In a cartesian grid search, users specify a set of values for each hyperparameter that they want to search over, and H2O will train a model for every combination of the hyperparameter values. This means that if you have three hyperparameters and you specify 5, 10 and 2 values for each, your grid will contain a total of 5102 = 100 models.
In random grid search, the user specifies the hyperparameter space in the exact same way, except H2O will sample uniformly from the set of all possible hyperparameter value combinations. In random grid search, the user also specifies a stopping criterion, which controls when the random grid search is completed. The user can tell the random grid search to stop by specifying a maximum number of models or the maximum number of seconds allowed for the search. The user may also specify a performance-metric-based stopping criterion, which will stop the random grid search when the performance stops improving by a specified amount.
Once the grid search is complete, the user can query the grid object and sort the models by a particular performance metric (for example, “AUC”). All models are stored in the H2O cluster and are accessible by model id.
Grid Search
Note the use of h2o.grid and hyper_params
# GBM hyperparamters
gbm_params1 <- list(learn_rate = c(0.01, 0.1), max_depth = c(3, 5, 9), sample_rate = c(0.8, 1.0),
col_sample_rate = c(0.2, 0.5, 1.0))
# Train and validate a cartesian grid of GBMs
gbm_grid1 <- h2o.grid("gbm", x = x, y = y, grid_id = "gbm_grid1", training_frame = train,
validation_frame = valid, ntrees = 100, seed = 1,
hyper_params = gbm_params1)
# Get the grid results, sorted by validation AUC
gbm_gridperf1 <- h2o.getGrid(grid_id = "gbm_grid1", sort_by = "auc", decreasing = TRUE)
print(gbm_gridperf1)
# Grab the top GBM model, chosen by validation AUC
best_gbm1 <- h2o.getModel(gbm_gridperf1@model_ids[[1]])
# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_gbm_perf1 <- h2o.performance(model = best_gbm1,
newdata = test)
h2o.auc(best_gbm_perf1) # 0.7781932
# Look at the hyperparamters for the best model
print(best_gbm1@model[["model_summary"]])Random Grid Search
Note the use of seq, search_criteria: search_criteria <- list(strategy = "RandomDiscrete", max_models = 36, seed = 1) and hyper_parameters
gbm_params2 <- list(learn_rate = seq(0.01, 0.1, 0.01), ax_depth = seq(2, 10, 1),
sample_rate = seq(0.5, 1.0, 0.1), col_sample_rate = seq(0.1, 1.0, 0.1))
search_criteria <- list(strategy = "RandomDiscrete", max_models = 36, seed = 1)
# Train and validate a random grid of GBMs
gbm_grid2 <- h2o.grid("gbm", x = x, y = y, grid_id = "gbm_grid2", training_frame = train,
validation_frame = valid, ntrees = 100, eed = 1,
hyper_params = gbm_params2, earch_criteria = search_criteria)
gbm_gridperf2 <- h2o.getGrid(grid_id = "gbm_grid2", srt_by = "auc", decreasing = TRUE)
print(gbm_gridperf2)
# Grab the top GBM model, chosen by validation AUC
best_gbm2 <- h2o.getModel(gbm_gridperf2@model_ids[[1]])
# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_gbm_perf2 <- h2o.performance(model = best_gbm2,
newdata = test)
h2o.auc(best_gbm_perf2) # 0.7811332
# Look at the hyperparamters for the best model
print(best_gbm2@model[["model_summary"]])Specify the
RandomDiscretestrategy to perform a random search of all the combinations of your hyperparameters.RandomDiscreteshould be combined with at least one early stopping criterion,max_modelsand/ormax_runtime_secs
Examples:
list(strategy = "RandomDiscrete", max_models = 10, seed = 1)
list(strategy = "RandomDiscrete", max_runtime_secs = 3600)
list(strategy = "RandomDiscrete", max_models = 42, max_runtime_secs = 28800)
list(strategy = "RandomDiscrete", stopping_tolerance = 0.001, stopping_rounds = 10)
list(strategy = "RandomDiscrete",
stopping_metric = "misclassification", stopping_tolerance = 0.0005, stopping_rounds = 5)To learn everything about this topic, must visit this tutorial
H2O deep learning model sets
nfolds = 0by default!
Develop code to improve the output of H2O leaderboard:
## model_id auc logloss
## 1 GLM_grid_0_AutoML_20181028_212737_model_0 0.85 0.32
## 2 StackedEnsemble_BestOfFamily_0_AutoML_20181028_212737 0.84 0.32
## 3 StackedEnsemble_AllModels_0_AutoML_20181028_212737 0.84 0.32
## 4 GBM_grid_0_AutoML_20181028_212737_model_4 0.82 0.34
## 5 GBM_grid_0_AutoML_20181028_212737_model_5 0.80 0.36
## 6 GBM_grid_0_AutoML_20181028_212737_model_1 0.80 0.36
## mean_per_class_error rmse mse
## 1 0.24 0.31 0.094
## 2 0.23 0.31 0.093
## 3 0.24 0.30 0.093
## 4 0.25 0.32 0.103
## 5 0.28 0.33 0.109
## 6 0.27 0.33 0.107
##
## [21 rows x 6 columns]
AUC is a way of measuring the performance of a binary classifier by comparing the false positive rate to the true positive rate. Logloss or logarithmic loss measures the performance of a classifier by comparing the class probability to actual value (1 or 0)
The
AUCandLoglossmetrics may suggest different models.
str_split simplify makes the output a matrix rather that a list. Makes it easier to extract data [,1]. This allows us to get the type of model from each model ID in the leaderboard.rownames_to_column adds row names of a data frame to a column. Gives us a column with the row name as a character. In this case it is very help because it identifies the order in which the models appear in the leaderboard.as_factor and as.factor. 'as_factor allows us to order the data by AUC. Recall as.factor would order alphabetically.gather to take advantage of ggplot color and group aesthetics. Data must be in long format. Long format has each faceting values stacked on top of each other making the data frame long rather than wide. - the key column will have the valuesaucandlogloss. Thevaluecolumn will have the values ofaucandlogloss. gather transposes auc and logloss from two columns to one - key and their associated values will be in the value column - Before:automl_models_h2o@leaderboard %>% as.tibble() %>%
mutate(model_type = str_split(model_id, "_", simplify = T)[,1]) %>% slice(1:10) %>%
rownames_to_column() %>%
mutate(model_id = as_factor(model_id) %>% reorder(auc), model_type = as.factor(model_type)) ## # A tibble: 10 x 8
## rowname model_id auc logloss mean_per_class_~ rmse mse model_type
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 1 GLM_gri~ 0.846 0.321 0.235 0.307 0.0940 GLM
## 2 2 Stacked~ 0.840 0.316 0.232 0.306 0.0934 StackedEn~
## 3 3 Stacked~ 0.839 0.316 0.236 0.305 0.0929 StackedEn~
## 4 4 GBM_gri~ 0.816 0.345 0.254 0.320 0.103 GBM
## 5 5 GBM_gri~ 0.800 0.359 0.285 0.330 0.109 GBM
## 6 6 GBM_gri~ 0.797 0.362 0.273 0.327 0.107 GBM
## 7 7 GBM_gri~ 0.795 0.351 0.259 0.322 0.104 GBM
## 8 8 GBM_gri~ 0.783 0.372 0.293 0.332 0.110 GBM
## 9 9 GBM_gri~ 0.783 0.379 0.288 0.337 0.113 GBM
## 10 10 GBM_gri~ 0.781 0.405 0.297 0.353 0.125 GBM
automl_models_h2o@leaderboard %>% as.tibble() %>%
mutate(model_type = str_split(model_id, "_", simplify = T)[,1]) %>% slice(1:10) %>%
rownames_to_column() %>%
mutate(model_id = as_factor(model_id) %>% reorder(auc), model_type = as.factor(model_type)) %>%
gather(key = key, value = value, -c(model_id, model_type, rowname), factor_key = T)## # A tibble: 50 x 5
## rowname model_id model_type key value
## <chr> <fct> <fct> <fct> <dbl>
## 1 1 GLM_grid_0_AutoML_20181028_212737_mo~ GLM auc 0.846
## 2 2 StackedEnsemble_BestOfFamily_0_AutoM~ StackedEnsem~ auc 0.840
## 3 3 StackedEnsemble_AllModels_0_AutoML_2~ StackedEnsem~ auc 0.839
## 4 4 GBM_grid_0_AutoML_20181028_212737_mo~ GBM auc 0.816
## 5 5 GBM_grid_0_AutoML_20181028_212737_mo~ GBM auc 0.800
## 6 6 GBM_grid_0_AutoML_20181028_212737_mo~ GBM auc 0.797
## 7 7 GBM_grid_0_AutoML_20181028_212737_mo~ GBM auc 0.795
## 8 8 GBM_grid_0_AutoML_20181028_212737_mo~ GBM auc 0.783
## 9 9 GBM_grid_0_AutoML_20181028_212737_mo~ GBM auc 0.783
## 10 10 GBM_grid_0_AutoML_20181028_212737_mo~ GBM auc 0.781
## # ... with 40 more rows
factor_rev() reverses the factor levels.
fct_rev:fct_rev:scales in the ggplot: scales can be fixed (“fixed”, the default), free (“free”), or free in one dimension (“free_x”, “free_y”)The course example used a deep learning model where only
aucandloglosswere available metrics. Other models may return those plusmean_per_class_error,rmse, andmse.
# Visualize the H2O leaderboard to help with model selection
plot_h2o_leaderboard <- function(h2o_leaderboard, order_by = c("auc", "logloss"),
n_max = 20, size = 4, include_lbl = TRUE) {
# Setup inputs
order_by <- tolower(order_by[[1]])
leaderboard_tbl <- h2o_leaderboard %>% as.tibble() %>%
mutate(model_type = str_split(model_id, "_", simplify = T) %>% .[,1]) %>%
rownames_to_column(var = "rowname") %>%
mutate(model_id = paste0(rowname, ". ", as.character(model_id)) %>% as.factor())
# Transformation
if (order_by == "auc") {
data_transformed_tbl <- leaderboard_tbl %>% slice(1:n_max) %>%
mutate(model_id = as_factor(model_id) %>% reorder(auc),
model_type = as.factor(model_type)) %>%
gather(key = key, value = value,
-c(model_id, model_type, rowname), factor_key = T)
} else if (order_by == "logloss") {
data_transformed_tbl <- leaderboard_tbl %>% slice(1:n_max) %>%
mutate(model_id = as_factor(model_id) %>% reorder(logloss) %>% fct_rev(),
model_type = as.factor(model_type)) %>%
gather(key = key, value = value, -c(model_id, model_type, rowname), factor_key = T)
} else {stop(paste0("order_by = '", order_by, "' is not a permitted option."))}
# Visualization
g <- data_transformed_tbl %>%
ggplot(aes(value, model_id, color = model_type)) + geom_point(size = size) +
facet_wrap(~ key, scales = "free_x") + theme_tq() + scale_color_tq() +
labs(title = "Leaderboard Metrics", subtitle = paste0("Ordered by: ", toupper(order_by)),
y = "Model Position, Model ID", x = "")
if (include_lbl) g <- g + geom_label(aes(label = round(value, 2), hjust = "inward"))
return(g)}# Test Model Visualization
automl_models_h2o@leaderboard %>% plot_h2o_leaderboard(n_max = 10, order_by = "auc")The number of facets changes with the model type. Deep Learning models might provide AUC and Logloss. Other models may return rmse, mse, etc.
In this chapter, AUC, Logloss, Precision - Recall Plots, ROC Plots, Gain and Lift Charts.
A consolidated plot will also be created - one that is suitable for executive presentation.
Create an H2O performance object. Could use h2o.loadModel() but it is simpler just to reuse the current leader. Also, lets get a deep learning model too.
myModel_GLM <- automl_models_h2o@leader
myModel_Deep <- extract_h2o_model_name_by_position(automl_models_h2o@leaderboard, position = 4, verbose = F) %>% h2o.getModel()h2o.performance() makes an H2O performance object by providing a model and new data.
## H2OBinomialMetrics: glm
##
## MSE: 0.089
## RMSE: 0.3
## LogLoss: 0.3
## Mean Per-Class Error: 0.22
## AUC: 0.87
## Gini: 0.74
## R^2: 0.35
## Residual Deviance: 131
## AIC: 291
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 171 13 0.070652 =13/184
## Yes 13 23 0.361111 =13/36
## Totals 184 36 0.118182 =26/220
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.361471 0.638889 35
## 2 max f2 0.305865 0.680628 46
## 3 max f0point5 0.556700 0.714286 18
## 4 max accuracy 0.556700 0.895455 18
## 5 max precision 0.907863 1.000000 0
## 6 max recall 0.040203 1.000000 145
## 7 max specificity 0.907863 1.000000 0
## 8 max absolute_mcc 0.361471 0.568237 35
## 9 max min_per_class_accuracy 0.196433 0.777174 68
## 10 max mean_per_class_accuracy 0.305865 0.804046 46
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
The performance object is verbose!
## $model
## $model$`__meta`
## $model$`__meta`$schema_version
## [1] 3
##
## $model$`__meta`$schema_name
## [1] "ModelKeyV3"
##
## $model$`__meta`$schema_type
## [1] "Key<Model>"
##
##
## $model$name
## [1] "GLM_grid_0_AutoML_20181028_212737_model_0"
##
## $model$type
## [1] "Key<Model>"
##
## $model$URL
## [1] "/3/Models/GLM_grid_0_AutoML_20181028_212737_model_0"
##
##
## $model_checksum
## [1] 1662347448424290560
##
## $frame
## $frame$name
## [1] "test_tbl"
##
##
## $frame_checksum
## [1] -54403433091472432
##
## $description
## NULL
##
## $scoring_time
## [1] 1540776501447
##
## $predictions
## NULL
##
## $MSE
## [1] 0.089
##
## $RMSE
## [1] 0.3
##
## $nobs
## [1] 220
##
## $custom_metric_name
## NULL
##
## $custom_metric_value
## [1] 0
##
## $r2
## [1] 0.35
##
## $logloss
## [1] 0.3
##
## $AUC
## [1] 0.87
##
## $pr_auc
## [1] 0.66
##
## $Gini
## [1] 0.74
##
## $mean_per_class_error
## [1] 0.22
##
## $domain
## [1] "No" "Yes"
##
## $cm
## $cm$`__meta`
## $cm$`__meta`$schema_version
## [1] 3
##
## $cm$`__meta`$schema_name
## [1] "ConfusionMatrixV3"
##
## $cm$`__meta`$schema_type
## [1] "ConfusionMatrix"
##
##
## $cm$table
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## No Yes Error Rate
## No 171 13 0.0707 = 13 / 184
## Yes 13 23 0.3611 = 13 / 36
## Totals 184 36 0.1182 = 26 / 220
##
##
## $thresholds_and_metric_scores
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
## threshold f1 f2 f0point5 accuracy precision recall
## 1 0.907863 0.054054 0.034483 0.125000 0.840909 1.000000 0.027778
## 2 0.815811 0.105263 0.068493 0.227273 0.845455 1.000000 0.055556
## 3 0.801944 0.153846 0.102041 0.312500 0.850000 1.000000 0.083333
## 4 0.769743 0.200000 0.135135 0.384615 0.854545 1.000000 0.111111
## 5 0.720564 0.243902 0.167785 0.446429 0.859091 1.000000 0.138889
## specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy
## 1 1.000000 0.152769 0.027778 0.513889
## 2 1.000000 0.216543 0.055556 0.527778
## 3 1.000000 0.265820 0.083333 0.541667
## 4 1.000000 0.307653 0.111111 0.555556
## 5 1.000000 0.344765 0.138889 0.569444
## tns fns fps tps tnr fnr fpr tpr idx
## 1 184 35 0 1 1.000000 0.972222 0.000000 0.027778 0
## 2 184 34 0 2 1.000000 0.944444 0.000000 0.055556 1
## 3 184 33 0 3 1.000000 0.916667 0.000000 0.083333 2
## 4 184 32 0 4 1.000000 0.888889 0.000000 0.111111 3
## 5 184 31 0 5 1.000000 0.861111 0.000000 0.138889 4
##
## ---
## threshold f1 f2 f0point5 accuracy precision recall
## 215 0.001598 0.286853 0.501393 0.200893 0.186364 0.167442 1.000000
## 216 0.001287 0.285714 0.500000 0.200000 0.181818 0.166667 1.000000
## 217 0.001130 0.284585 0.498615 0.199115 0.177273 0.165899 1.000000
## 218 0.000965 0.283465 0.497238 0.198238 0.172727 0.165138 1.000000
## 219 0.000884 0.282353 0.495868 0.197368 0.168182 0.164384 1.000000
## 220 0.000824 0.281250 0.494505 0.196507 0.163636 0.163636 1.000000
## specificity absolute_mcc min_per_class_accuracy
## 215 0.027174 0.067454 0.027174
## 216 0.021739 0.060193 0.021739
## 217 0.016304 0.052008 0.016304
## 218 0.010870 0.042367 0.010870
## 219 0.005435 0.029890 0.005435
## 220 0.000000 0.000000 0.000000
## mean_per_class_accuracy tns fns fps tps tnr fnr fpr
## 215 0.513587 5 0 179 36 0.027174 0.000000 0.972826
## 216 0.510870 4 0 180 36 0.021739 0.000000 0.978261
## 217 0.508152 3 0 181 36 0.016304 0.000000 0.983696
## 218 0.505435 2 0 182 36 0.010870 0.000000 0.989130
## 219 0.502717 1 0 183 36 0.005435 0.000000 0.994565
## 220 0.500000 0 0 184 36 0.000000 0.000000 1.000000
## tpr idx
## 215 1.000000 214
## 216 1.000000 215
## 217 1.000000 216
## 218 1.000000 217
## 219 1.000000 218
## 220 1.000000 219
##
## $max_criteria_and_metric_scores
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.361471 0.638889 35
## 2 max f2 0.305865 0.680628 46
## 3 max f0point5 0.556700 0.714286 18
## 4 max accuracy 0.556700 0.895455 18
## 5 max precision 0.907863 1.000000 0
## 6 max recall 0.040203 1.000000 145
## 7 max specificity 0.907863 1.000000 0
## 8 max absolute_mcc 0.361471 0.568237 35
## 9 max min_per_class_accuracy 0.196433 0.777174 68
## 10 max mean_per_class_accuracy 0.305865 0.804046 46
##
## $gains_lift_table
## Gains/Lift Table: Avg response rate: 16.36 %, avg score: 17.06 %
## group cumulative_data_fraction lower_threshold lift cumulative_lift
## 1 1 0.01363636 0.795825 6.111111 6.111111
## 2 2 0.02272727 0.698406 6.111111 6.111111
## 3 3 0.03181818 0.657941 6.111111 6.111111
## 4 4 0.04090909 0.631589 6.111111 6.111111
## 5 5 0.05000000 0.610115 3.055556 5.555556
## 6 6 0.10000000 0.493094 3.888889 4.722222
## 7 7 0.15000000 0.373599 2.222222 3.888889
## 8 8 0.20000000 0.314707 2.222222 3.472222
## 9 9 0.30000000 0.200597 0.555556 2.500000
## 10 10 0.40000000 0.128708 0.833333 2.083333
## 11 11 0.50000000 0.087737 0.555556 1.777778
## 12 12 0.60000000 0.055181 0.277778 1.527778
## 13 13 0.70000000 0.031811 0.833333 1.428571
## 14 14 0.80000000 0.017333 0.000000 1.250000
## 15 15 0.90000000 0.008458 0.000000 1.111111
## 16 16 1.00000000 0.000824 0.000000 1.000000
## response_rate score cumulative_response_rate cumulative_score
## 1 1.000000 0.841873 1.000000 0.841873
## 2 1.000000 0.745154 1.000000 0.803185
## 3 1.000000 0.662249 1.000000 0.762918
## 4 1.000000 0.644891 1.000000 0.736689
## 5 0.500000 0.625222 0.909091 0.716423
## 6 0.636364 0.569948 0.772727 0.643185
## 7 0.363636 0.425287 0.636364 0.570552
## 8 0.363636 0.343600 0.568182 0.513814
## 9 0.090909 0.256735 0.409091 0.428121
## 10 0.136364 0.163885 0.340909 0.362062
## 11 0.090909 0.103809 0.290909 0.310412
## 12 0.045455 0.071676 0.250000 0.270622
## 13 0.136364 0.041599 0.233766 0.237905
## 14 0.000000 0.024287 0.204545 0.211202
## 15 0.000000 0.012672 0.181818 0.189144
## 16 0.000000 0.004134 0.163636 0.170643
## capture_rate cumulative_capture_rate gain cumulative_gain
## 1 0.083333 0.083333 511.111111 511.111111
## 2 0.055556 0.138889 511.111111 511.111111
## 3 0.055556 0.194444 511.111111 511.111111
## 4 0.055556 0.250000 511.111111 511.111111
## 5 0.027778 0.277778 205.555556 455.555556
## 6 0.194444 0.472222 288.888889 372.222222
## 7 0.111111 0.583333 122.222222 288.888889
## 8 0.111111 0.694444 122.222222 247.222222
## 9 0.055556 0.750000 -44.444444 150.000000
## 10 0.083333 0.833333 -16.666667 108.333333
## 11 0.055556 0.888889 -44.444444 77.777778
## 12 0.027778 0.916667 -72.222222 52.777778
## 13 0.083333 1.000000 -16.666667 42.857143
## 14 0.000000 1.000000 -100.000000 25.000000
## 15 0.000000 1.000000 -100.000000 11.111111
## 16 0.000000 1.000000 -100.000000 0.000000
##
## $residual_deviance
## [1] 131
##
## $null_deviance
## [1] 196
##
## $AIC
## [1] 291
##
## $null_degrees_of_freedom
## [1] 219
##
## $residual_degrees_of_freedom
## [1] 140
Accessing the information from the performance object is simple!
Some metrics are directly available like auc: Area Under the Curve, referring to the Reciever Operating Characteristics (ROC) plot that measures the true positive rate (TPR) and the false positive rate (FPR)
## [1] 0.87
`h2o.auc() provides options for returning values for train, validation, and cross validation. However, these options are valid only for models, not performance objects.
This works:
## train valid xval
## 0.90 0.81 0.85
This does not return multiple values:
## [1] 0.87
\(AUC = (GiniCoeff + 1)/2)\) | \(Gini=(1-AUC)*2\)
The Gini Coefficient is a popular metric on Kaggle, especially for imbalanced class values.
The Gini coefficient is one of many to measure divergence, i.e. the difference in the distribution of two variables in a data set. Values range from -1 to 1 to indicate total divergence one direction or the other, and 0 percent representing no divergence.
## [1] 0.74
## train valid xval
## 0.80 0.62 0.69
Logloss is a metric that measures the class probability from the model against the actual value in binary format (0/1). See this for more information.
Log Loss quantifies the accuracy of a classifier by penalizing false classifications. Minimizing the Log Loss is basically equivalent to maximizing the accuracy of the classifier. A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss.
Log Loss heavily penalizes classifiers that are confident about an incorrect classification. For example, if for a particular observation, the classifier assigns a very small probability to the correct class then the corresponding contribution to the Log Loss will be very large indeed. Naturally this is going to have a significant impact on the overall Log Loss for the classifier. The bottom line is that it’s better to be somewhat wrong than emphatically wrong. Of course it’s always better to be completely right, but that is seldom achievable in practice! There are at least two approaches to dealing with poor classifications:
## [1] 0.3
## train valid xval
## 0.27 0.30 0.32
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.361470825344551:
## No Yes Error Rate
## No 171 13 0.070652 =13/184
## Yes 13 23 0.361111 =13/36
## Totals 184 36 0.118182 =26/220
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.348527593514236:
## No Yes Error Rate
## No 838 51 0.057368 =51/889
## Yes 58 118 0.329545 =58/176
## Totals 896 169 0.102347 =109/1065
Must understand threshold, precision and recall:
Review the output from the confusion matrix.
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.348527593514236:
## No Yes Error Rate
## No 838 51 0.057368 =51/889
## Yes 58 118 0.329545 =58/176
## Totals 896 169 0.102347 =109/1065
How does the threshold max f1 affect the models? Use the h2o.metric. Make the metric easier to read by turning it into a tibble.
## # A tibble: 220 x 20
## threshold f1 f2 f0point5 accuracy precision recall specificity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.908 0.0541 0.0345 0.125 0.841 1 0.0278 1
## 2 0.816 0.105 0.0685 0.227 0.845 1 0.0556 1
## 3 0.802 0.154 0.102 0.312 0.85 1 0.0833 1
## 4 0.770 0.20 0.135 0.385 0.855 1 0.111 1
## 5 0.721 0.244 0.168 0.446 0.859 1 0.139 1
## 6 0.662 0.286 0.200 0.5 0.864 1 0.167 1
## 7 0.662 0.326 0.232 0.547 0.868 1 0.194 1
## 8 0.655 0.364 0.263 0.588 0.873 1 0.222 1
## 9 0.635 0.4 0.294 0.625 0.877 1 0.25 1
## 10 0.630 0.435 0.325 0.658 0.882 1 0.278 1
## # ... with 210 more rows, and 12 more variables: absolute_mcc <dbl>,
## # min_per_class_accuracy <dbl>, mean_per_class_accuracy <dbl>,
## # tns <dbl>, fns <dbl>, fps <dbl>, tps <dbl>, tnr <dbl>, fnr <dbl>,
## # fpr <dbl>, tpr <dbl>, idx <int>
Precision
\[Precision = TP/(TP + FP)\]
In other words, Precision detects how frequently we over-pick the Yes class. In the confusion matrix above, 51 records were over picked - these employees were predicted to leave but did not. (No real harm in this - the company did not lose money.)
## [1] 0.7
This is the percent of time the algorithm predicted someone would leave correctly.
Recall
\[Recall = TP/(FN + TP)\] Recall provides a metric for under-picking Yes class.
## [1] 0.67
This measures how often will miss people that leave by predicting they will stay.
Recall is typically more important than Precision because you usually would rather give up some false positives (inadvertently target stayers) to gain false negatives (accurately predict quitters).
F1
F1 balances precision and recall.
\[F1 = (2 * (Precision * Recall))/(Precision + Recall)\]
## [1] 0.68
The threshold is the limit that maximizes F1 across all possibilities.
Threshold{#precisionrecallplot}
Note the use of a new H~2~O function called h2o.find_threshold_by_max_metric.
performance_tbl %>% ggplot(aes(x = threshold)) +
geom_line(aes(y = precision), color = "blue", size = 1) +
geom_line(aes(y = recall), color = "red", size = 1) +
geom_vline(xintercept = h2o.find_threshold_by_max_metric(perf_h2o_GLM, "f1")) +
theme_tq() + labs(title = "Precision vs Recall", y = "value")To compare models, we need to load a few more models. A Deep Model has already been saved. Lets save one of the Stacked Models.
## model_id auc logloss
## 1 GLM_grid_0_AutoML_20181028_212737_model_0 0.85 0.32
## 2 StackedEnsemble_BestOfFamily_0_AutoML_20181028_212737 0.84 0.32
## 3 StackedEnsemble_AllModels_0_AutoML_20181028_212737 0.84 0.32
## 4 GBM_grid_0_AutoML_20181028_212737_model_4 0.82 0.34
## 5 GBM_grid_0_AutoML_20181028_212737_model_5 0.80 0.36
## 6 GBM_grid_0_AutoML_20181028_212737_model_1 0.80 0.36
## mean_per_class_error rmse mse
## 1 0.24 0.31 0.094
## 2 0.23 0.31 0.093
## 3 0.24 0.30 0.093
## 4 0.25 0.32 0.103
## 5 0.28 0.33 0.109
## 6 0.27 0.33 0.107
##
## [21 rows x 6 columns]
Next collect the performance metrics for the Deep Learning and Stacked models. Develop a new functions: load_model_performance_metrics and Model_Metrics.
load_model_performance_metrics <- function(path, test_tbl){
model_h2o <- h2o.loadModel(path)
perf_h2o <- h2o.performance(model_h2o, newdata = as.h2o(test_tbl))
perf_h2o %>% h2o.metric() %>% as.tibble() %>% mutate(auc = h2o.auc(perf_h2o)) %>%
select(tpr, fpr, auc)}model_metrics_tbl <- fs::dir_info(path = "04_Modeling/h2o_models/") %>%
select(path) %>%
mutate(metrics = map(path, load_model_performance_metrics, test_tbl)) %>%
unnest()#it was a list now its flatThe ROC Plot compares the TPR (y-axis) against the FPR (x-axis). TPR is the rate which people are leaving correctly identified as leaving. FPR is the rate people that stay are incorrectly identified as leaving.
# Data prep for plotting; # recall numeric to factor must first convert to character
model_metrics_tbl %>% mutate(
path = str_split(path, pattern = "/", simplify = T)[, 3] %>% as_factor(),
auc = auc %>% round(3) %>% as.character() %>% as_factor()) %>%
ggplot(aes(x = fpr, y = tpr, color = path, linetype = auc)) + geom_line(size = 1) + theme_tq() + scale_color_tq()+
theme(legend.direction = "vertical") + labs(title ="ROC Plot", subtitle = "Performance of Top Performing Models") +
annotate(geom = "text", x = 0.6, y = 0.25, label = "TPR - Rate people correctly identified as leaving") +
annotate(geom = "text", x = 0.6, y = 0.20, label = "FPR - Rate people that stay incorrectly identified as leaving")Turn precision / recall trade off into performance. This is another multiple model evaluation.
load_model_performance_metrics <- function(path, test_tbl){
model_h2o <- h2o.loadModel(path)
perf_h2o <- h2o.performance(model_h2o, newdata = as.h2o(test_tbl))
perf_h2o %>% h2o.metric() %>% as.tibble() %>% mutate(auc = h2o.auc(perf_h2o)) %>%
select(tpr, fpr, auc, precision, recall)}# Data prep for plotting; # recall numeric to factor must first convert to character
model_metrics_tbl %>% mutate(
path = str_split(path, pattern = "/", simplify = T)[, 3] %>% as_factor(),
auc = auc %>% round(3) %>% as.character() %>% as_factor()) %>%
ggplot(aes(x = recall, y = precision, color = path, linetype = auc)) + geom_line(size = 1) + theme_tq() + scale_color_tq()+
theme(legend.direction = "vertical") + labs(title ="Precision vs Recall Plot",
subtitle = "Performance of Top Performing Models") +
annotate(geom = "text", x = 0.6, y = 0.25, label = "Precision - Predicting employees to leave buy actually stay") +
annotate(geom = "text", x = 0.6, y = 0.20, label = "Recall - Predicting employee to stay but actually leaves") > Goal is to reduce the under-prediction i.e., employees predicted to stay but actually leave (Recall)
Gain and Lift charts are used to evaluate performance of classification model. They measure how much better one can expect to do with the predictive model comparing without a model. It’s a very popular metrics in marketing analytics. It’s not just restricted to marketing analysis. It can be used in other domains as well such as risk modeling, supply chain analytics etc. It also helps to find the best predictive model among multiple challenger models.
See https://www.listendata.com/2014/08/excel-template-gain-and-lift-charts.html - seems to follow the same process as the one below. The goal is to make a Gain-Lift Table:
score file in descending order by estimated probability. By ranking by class probability of Yes, we assess the models ability to detect someone that is leaving.bind_cols which is part of dplyrranked_predictions_tbl <- predictions_tbl %>% bind_cols(test_tbl) %>% select(predict:Yes, Attrition) %>%
arrange(desc(Yes))
ranked_predictions_tbl## # A tibble: 220 x 4
## predict No Yes Attrition
## <fct> <dbl> <dbl> <fct>
## 1 Yes 0.0921 0.908 Yes
## 2 Yes 0.184 0.816 Yes
## 3 Yes 0.198 0.802 Yes
## 4 Yes 0.230 0.770 Yes
## 5 Yes 0.279 0.721 Yes
## 6 Yes 0.338 0.662 Yes
## 7 Yes 0.338 0.662 Yes
## 8 Yes 0.345 0.655 Yes
## 9 Yes 0.365 0.635 Yes
## 10 Yes 0.370 0.630 Yes
## # ... with 210 more rows
Steps:
Split the ranked file into 10 sections (deciles). Grouping into cohorts of most likely groups to least likely groups is at the heart of gain/lift. When we do this, we can immediately show that if a candidate has a high probability of leaving, how likely they are to leave.
group = row_number()Number of actual events in each decile: responses = sum(Attrition == "Yes")
ranked_predictions_tbl %>% mutate(ntile = ntile(Yes, n = 10)) %>% group_by(ntile) %>%
summarise(cases = n(), responses = sum(Attrition == "Yes")) %>% arrange(desc(ntile))## # A tibble: 10 x 3
## ntile cases responses
## <int> <int> <int>
## 1 10 22 17
## 2 9 22 8
## 3 8 22 2
## 4 7 22 3
## 5 6 22 2
## 6 5 22 1
## 7 4 22 3
## 8 3 22 0
## 9 2 22 0
## 10 1 22 0
In ntile = 10, 17 people actually left out of 22 employees! 8 of 22 actually left in the next ntile.
ntile also from dplyr. We simply group records into 10 groups. Note the 10th group has the highest values for Yes. Makes the continuous variable Yes into a discrete variable using the new groups called ntile.dpplyr::row_numer is useful for calculations to simply make the row numbers of a tibble into a columnranked_predictions_tbl %>% mutate(ntile = ntile(Yes, n = 10)) %>% group_by(ntile) %>%
summarise(cases = n(), responses = sum(Attrition == "Yes")) %>% arrange(desc(ntile)) %>% mutate(group = row_number())## # A tibble: 10 x 4
## ntile cases responses group
## <int> <int> <int> <int>
## 1 10 22 17 1
## 2 9 22 8 2
## 3 8 22 2 3
## 4 7 22 3 4
## 5 6 22 2 5
## 6 5 22 1 6
## 7 4 22 3 7
## 8 3 22 0 8
## 9 2 22 0 9
## 10 1 22 0 10
Now group - the row numbers - now help. The first group has the highest number of employees that actually left. No longer the the ntile column.
Time to calculate the gain and lift.
cumulative_responses = cumsum(responses)pct_responses = responses / sum(responses)pct_responses. gain = cumsum(pct_responses)
lift. cumulative_pct_cases = cumsum(cases)/sum(cases). This is a simple calculation. Since we have 10 groups, this calculation will start at 0.100 and end with 1.00 step by 0.100 so: .1, .2, 3. . .lift. Think lift as a multiplier between what we gained divided by what we expected to gain with no model. For example, if we focused on the first 2 decile groups, we gained the ability to target ~70% of our quitter but we expected to be able to target only 20% (2 deciles (cummulative_pct_cases) = 20%)
lift is just another way to show the gainlift = 4.7 simply means you are 4.7X more likely to identify quitters using the modelcalculated_gain_lift_tbl <- ranked_predictions_tbl %>% mutate(ntile = ntile(Yes, n = 10)) %>% group_by(ntile) %>%
summarise(cases = n(), responses = sum(Attrition == "Yes")) %>% arrange(desc(ntile)) %>%
mutate(group = row_number()) %>% select(group, cases, responses) %>%
mutate(cumulative_responses = cumsum(responses), pct_responses = responses / sum(responses),
gain = cumsum(pct_responses),
# we gain ability to target 75% of quitter focused on 1st 3 groups
cumulative_pct_cases = cumsum(cases)/sum(cases), lift = gain/cumulative_pct_cases,
gain_baseline = cumulative_pct_cases, lift_baseline = gain_baseline / cumulative_pct_cases)
calculated_gain_lift_tbl %>% glimpse()## Observations: 10
## Variables: 10
## $ group <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ cases <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22
## $ responses <int> 17, 8, 2, 3, 2, 1, 3, 0, 0, 0
## $ cumulative_responses <int> 17, 25, 27, 30, 32, 33, 36, 36, 36, 36
## $ pct_responses <dbl> 0.472, 0.222, 0.056, 0.083, 0.056, 0.028,...
## $ gain <dbl> 0.47, 0.69, 0.75, 0.83, 0.89, 0.92, 1.00,...
## $ cumulative_pct_cases <dbl> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0...
## $ lift <dbl> 4.7, 3.5, 2.5, 2.1, 1.8, 1.5, 1.4, 1.2, 1...
## $ gain_baseline <dbl> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0...
## $ lift_baseline <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Above, uncertain why gain_baseline and lift_baseline were added. (BTW, `lift_baseline is always = 1.)
h2o.gainsLift streamline the manual gain-lift code above. h2o.performance() is the starting point for the H2O fain and lift function.
## Observations: 16
## Variables: 13
## $ group <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ cumulative_data_fraction <dbl> 0.014, 0.023, 0.032, 0.041, 0.050, 0....
## $ lower_threshold <dbl> 0.79583, 0.69841, 0.65794, 0.63159, 0...
## $ lift <dbl> 6.11, 6.11, 6.11, 6.11, 3.06, 3.89, 2...
## $ cumulative_lift <dbl> 6.1, 6.1, 6.1, 6.1, 5.6, 4.7, 3.9, 3....
## $ response_rate <dbl> 1.000, 1.000, 1.000, 1.000, 0.500, 0....
## $ score <dbl> 0.8419, 0.7452, 0.6622, 0.6449, 0.625...
## $ cumulative_response_rate <dbl> 1.00, 1.00, 1.00, 1.00, 0.91, 0.77, 0...
## $ cumulative_score <dbl> 0.84, 0.80, 0.76, 0.74, 0.72, 0.64, 0...
## $ capture_rate <dbl> 0.083, 0.056, 0.056, 0.056, 0.028, 0....
## $ cumulative_capture_rate <dbl> 0.083, 0.139, 0.194, 0.250, 0.278, 0....
## $ gain <dbl> 511, 511, 511, 511, 206, 289, 122, 12...
## $ cumulative_gain <dbl> 511, 511, 511, 511, 456, 372, 289, 24...
ntile = 10)Only a subset of the H2O data is needed:
gain_lift_tbl <- perf_h2o_GLM %>% h2o.gainsLift() %>% select(group, cumulative_data_fraction, cumulative_capture_rate, cumulative_lift) %>% as.tibble()
gain_lift_tbl## # A tibble: 16 x 4
## group cumulative_data_fraction cumulative_capture_rate cumulative_lift
## <int> <dbl> <dbl> <dbl>
## 1 1 0.0136 0.0833 6.11
## 2 2 0.0227 0.139 6.11
## 3 3 0.0318 0.194 6.11
## 4 4 0.0409 0.25 6.11
## 5 5 0.05 0.278 5.56
## 6 6 0.1 0.472 4.72
## 7 7 0.15 0.583 3.89
## 8 8 0.2 0.694 3.47
## 9 9 0.3 0.75 2.5
## 10 10 0.4 0.833 2.08
## 11 11 0.5 0.889 1.78
## 12 12 0.6 0.917 1.53
## 13 13 0.7 1 1.43
## 14 14 0.8 1 1.25
## 15 15 0.9 1 1.11
## 16 16 1 1 1
This is all that is needed to create out gain and lift plot.
gather() to stack the data to use ggplot color, group aesthetics or facet_wrapcumumalitive_lift is not needed (but it is later for lift)gain_transformed_tbl <- gain_lift_tbl %>%
select(group, cumulative_data_fraction, cumulative_capture_rate) %>% mutate(baseline = cumulative_data_fraction) %>%
rename(gain = cumulative_capture_rate) %>% gather(key = key, value = value, gain, baseline)
gain_transformed_tbl## # A tibble: 32 x 4
## group cumulative_data_fraction key value
## <int> <dbl> <chr> <dbl>
## 1 1 0.0136 gain 0.0833
## 2 2 0.0227 gain 0.139
## 3 3 0.0318 gain 0.194
## 4 4 0.0409 gain 0.25
## 5 5 0.05 gain 0.278
## 6 6 0.1 gain 0.472
## 7 7 0.15 gain 0.583
## 8 8 0.2 gain 0.694
## 9 9 0.3 gain 0.75
## 10 10 0.4 gain 0.833
## # ... with 22 more rows
gain_transformed_tbl %>% ggplot(aes(x = cumulative_data_fraction, y = value, color = key)) +
geom_line(size = 1.5) + theme_tq() + scale_color_tq() +
labs(title = "Gain Chart", x = "Cumulative Data Fraction", y = "Gain")cumulative_capture_ratelift_transformed_tbl <- gain_lift_tbl %>%
select(group, cumulative_data_fraction, cumulative_lift) %>% mutate(baseline = 1) %>%
rename(lift = cumulative_lift) %>% gather(key = key, value = value, lift, baseline)
lift_transformed_tbl## # A tibble: 32 x 4
## group cumulative_data_fraction key value
## <int> <dbl> <chr> <dbl>
## 1 1 0.0136 lift 6.11
## 2 2 0.0227 lift 6.11
## 3 3 0.0318 lift 6.11
## 4 4 0.0409 lift 6.11
## 5 5 0.05 lift 5.56
## 6 6 0.1 lift 4.72
## 7 7 0.15 lift 3.89
## 8 8 0.2 lift 3.47
## 9 9 0.3 lift 2.5
## 10 10 0.4 lift 2.08
## # ... with 22 more rows
lift_transformed_tbl %>% ggplot(aes(x = cumulative_data_fraction, y = value, color = key)) +
geom_line(size = 1.5) + theme_tq() + scale_color_tq() +
labs(title = "Lift Chart", x = "Cumulative Data Fraction", y = "Lift")NOTE: The plots used for explanation differ from the calculated plots in the RMD slightly.
Below, doing nothing - not using the model - 25% of employees would be targeted to leave. With the model, 75% employees would be captured - 3X more!
Strategically targeting 25% of high probability customers should yield 75% of potential positive responses!
Lift is a multiplier: How many positive responses would you get at random X Lift = How many would you get with the model? 3X shown below.
Example: How does lift apply to attrition? - Providing stock options strategically to high performers that are at high risk. Could strategically focus on those in the high risk category that are working overtime. Stock options would cost overall less when compared to offering stock options randomly. Reduces cost by 33%!
Always explain in terms of more customers, less churn, better quality, reduced lead times, or anything else executives care about.
plot_h2o_performance brings together the model and business plots into one function: ROC, Precision vs Recall, Gain and Lift plots.
order_by <- tolower(order_by[[1]]) simply selects the first in the list in argument order_by = c("auc", "logloss"). If user does not select one, auc will be used as default.rlang::sym(order_by) converts a string stored within a variable to a column name (symbol) that is unevaluated so we can use it later using !! (bang bang)h2o.getModel gets a model from the current session. loadModel retrieves a model that was previously saved to disk.fct_reorder(!! order_by_expr, .desc = ifelse(order_by =="auc", TRUE, FALSE)) is cool. If the selected metric is auc, the code will order highest to lowest. If the metric is logloss, it sorts the data lowest to highest. Very cool!fct_reorder requires numeric valueglue is used - a better method of pasting strings than paste0.
paste0plot_h2o_performance <- function(h2o_leaderboard, newdata, order_by = c("auc", "logloss"), max_models = 3, size = 1.5){
# Inputs
leaderboard_tbl <- h2o_leaderboard %>% as.tibble() %>% slice(1:max_models)
newdata_tbl <- newdata %>% as.tibble()
order_by <- tolower(order_by[[1]])
order_by_expr <- rlang::sym(order_by)
h2o.no_progress()
# 1. Model Metrics
get_model_performance_metrics <- function(model_id, test_tbl){
model_h2o <- h2o.getModel(model_id)
perf_h2o <- h2o.performance(model_h2o, newdata = as.h2o(test_tbl))
perf_h2o %>% h2o.metric() %>% as.tibble() %>% select(threshold, tpr, fpr,
precision, recall)}
model_metrics_tbl <- leaderboard_tbl %>%
mutate(metrics = map(model_id, get_model_performance_metrics, newdata_tbl)) %>%
unnest() %>%
mutate(
model_id = as_factor(model_id) %>%
fct_reorder(!! order_by_expr, .desc = ifelse(order_by =="auc", TRUE, FALSE)),
auc = auc %>% round(3) %>% as.character() %>% as_factor() %>%
fct_reorder(as.numeric(model_id)),
logloss = logloss %>% round(4) %>% as.character() %>% as_factor() %>%
fct_reorder(as.numeric(model_id)))
# 1A. ROC Plot
p1 <- model_metrics_tbl %>% ggplot(aes_string(x = "fpr",
y = "tpr", color = "model_id", linetype = order_by)) + geom_line(size = size) +
theme_tq() + scale_color_tq() + labs(title = "ROC", x = "FPR", y = "TPR") +
theme(legend.direction = "vertical") +
annotate(geom = "text", x = 0.4, y = 0.25, size = 3, color = "DarkSlateGrey",
label = "TPR - Rate people correctly identified as leaving") +
annotate(geom = "text", x = 0.4, y = 0.20, size = 3, color = "DarkSlateGrey",
label = "FPR - Rate people that stay incorrectly identified as leaving")
# 1B. Precision vs Recall Plot
p2 <- model_metrics_tbl %>%
ggplot(aes_string(x = "recall", y = "precision", color = "model_id", linetype = order_by)) +
geom_line(size = size) +
theme_tq() + scale_color_tq() + labs(title = "Precision vs Recall",
x = "Recall", y = "Precision") +
theme(legend.position = "none") +
annotate(geom = "text", x = 0.4, y = 0.25, size = 3, color = "DarkSlateGrey",
label = "Precision - Predicting employees to leave buy actually stay") +
annotate(geom = "text", x = 0.4, y = 0.20, size = 3, color = "DarkSlateGrey",
label = "Recall - Predicting employee to stay but actually leaves")
# 2. Gain / Lift
get_gain_lift <- function(model_id, test_tbl){
model_h2o <- h2o.getModel(model_id)
perf_h2o <- h2o.performance(model_h2o, newdata = as.h2o(test_tbl))
perf_h2o %>% h2o.gainsLift() %>% as.tibble() %>%
select(group, cumulative_data_fraction, cumulative_capture_rate, cumulative_lift)}
gain_lift_tbl <- leaderboard_tbl %>%
mutate(metrics = map(model_id, get_gain_lift, newdata_tbl)) %>% unnest() %>%
mutate(
model_id = as_factor(model_id) %>%
fct_reorder(!! order_by_expr, .desc = ifelse(order_by == "auc", TRUE, FALSE)),
auc = auc %>% round(3) %>% as.character() %>% as_factor() %>%
fct_reorder(as.numeric(model_id)),
logloss = logloss %>% round(4) %>% as.character() %>% as_factor() %>%
fct_reorder(as.numeric(model_id))) %>%
rename(gain = cumulative_capture_rate, lift = cumulative_lift)
# 2A. Gain Plot
p3 <- gain_lift_tbl %>% ggplot(aes_string(x = "cumulative_data_fraction", y = "gain",
color= "model_id", linetype = order_by)) + geom_line(size = size) +
geom_segment(x = 0, y = 0, xend = 1, yend = 1, color = "black", size = size) + theme_tq() + scale_color_tq() +
expand_limits(x = c(0, 1), y = c(0, 1)) + labs(title = "Gain", x = "Cumulative Data Fraction", y = "Gain") +
theme(legend.position = "none")
# 2B. Lift Plot
p4 <- gain_lift_tbl %>% ggplot(aes_string(x = "cumulative_data_fraction", y = "lift",
color= "model_id", linetype = order_by)) + geom_line(size = size) +
geom_segment(x = 0, y = 1, xend = 1, yend = 1, color = "black", size = size) + theme_tq() + scale_color_tq() +
expand_limits(x = c(0, 1), y = c(0, 1)) + labs(title = "Lift", x = "Cumulative Data Fraction", y = "Lift") +
theme(legend.position = "none")
# Cow Plot
p_legend <- cowplot::get_legend(p1)
p1 <- p1 + theme(legend.position = "none")
p <- cowplot::plot_grid(p1, p2, p3, p4, ncol=2)
p_title <- ggdraw() + draw_label("H2O Model Metrics", size = 18, fontface = "bold",
colour = palette_light()[[1]])#MUST use COLOUR
p_subtitle <- ggdraw() + draw_label(glue("Ordered by {toupper(order_by)}"), size = 10,
colour = palette_light()[[1]])
ret <- plot_grid(p_title, p_subtitle, p, p_legend, ncol = 1, rel_heights = c(0.05, 0.05, 1, 0.05 * max_models))
h2o.show_progress()
return(ret)}Test the new function.
automl_models_h2o@leaderboard %>%
plot_h2o_performance(newdata = test_tbl, order_by = "auc", max_models = 3)automl_models_h2o@leaderboard %>%
plot_h2o_performance(newdata = test_tbl, order_by = "logloss", max_models = 2)
Take the developed H2O models and show how to explain localized prediction results using a special technique called LIME (Local Interpretable Model-Agnostic Explanations).
lime() function for single and multiple employeesexplain() function for single and multiple employeesplot_features() (one or a small number of observations) and plot_explanations() (for many observations)This section answers the question WHY something happens - at the record level.
This is a preparatory step before getting into the weeds with lime. This should be familiar and serves as a review.
## # A tibble: 6 x 3
## predict No Yes
## <fct> <dbl> <dbl>
## 1 No 0.964 0.0359
## 2 No 0.891 0.109
## 3 No 0.967 0.0330
## 4 No 0.953 0.0471
## 5 Yes 0.210 0.790
## 6 No 0.937 0.0631
Focus on record 5 since it is predicted to turnover with a high probability. Get more information about this person so we know what actually happened and also get the employees ID so we know who we are talking about.
predictions_tbl <- automl_leader %>% h2o.predict(newdata = as.h2o(test_tbl)) %>% as.tibble() %>%
bind_cols(test_tbl %>% select(Attrition, EmployeeNumber))
predictions_tbl %>% head()## # A tibble: 6 x 5
## predict No Yes Attrition EmployeeNumber
## <fct> <dbl> <dbl> <fct> <dbl>
## 1 No 0.964 0.0359 No 228
## 2 No 0.891 0.109 No 1278
## 3 No 0.967 0.0330 No 1250
## 4 No 0.953 0.0471 No 2065
## 5 Yes 0.210 0.790 Yes 1767
## 6 No 0.937 0.0631 No 1308
We now know the employee actually did turnover.
Now find the detail about employee 1767 that did turnover as predicted.
## Observations: 1
## Variables: 32
## $ Age <dbl> 43
## $ Attrition <fct> Yes
## $ BusinessTravel <fct> Travel_Frequently
## $ DailyRate <dbl> 807
## $ Department <fct> Research & Development
## $ DistanceFromHome <dbl> 17
## $ Education <fct> Bachelor
## $ EducationField <fct> Technical Degree
## $ EmployeeNumber <dbl> 1767
## $ EnvironmentSatisfaction <fct> High
## $ Gender <fct> Male
## $ HourlyRate <dbl> 38
## $ JobInvolvement <fct> Medium
## $ JobLevel <fct> 1
## $ JobRole <fct> Research Scientist
## $ JobSatisfaction <fct> High
## $ MaritalStatus <fct> Married
## $ MonthlyIncome <dbl> 2437
## $ MonthlyRate <dbl> 15587
## $ NumCompaniesWorked <dbl> 9
## $ OverTime <fct> Yes
## $ PercentSalaryHike <dbl> 16
## $ PerformanceRating <fct> Excellent
## $ RelationshipSatisfaction <fct> Very High
## $ StockOptionLevel <fct> 1
## $ TotalWorkingYears <dbl> 6
## $ TrainingTimesLastYear <dbl> 4
## $ WorkLifeBalance <fct> Better
## $ YearsAtCompany <dbl> 1
## $ YearsInCurrentRole <dbl> 0
## $ YearsSinceLastPromotion <dbl> 0
## $ YearsWithCurrManager <dbl> 0
It appears some factors support attrition and some do not. If we want to know why he turned over, use lime.
LIME is used to determine which features contribute to the prediction (and by how much) for a single (local) observation
lime()
explain()h2o, keras and caret have been integrated into
lime(). For other packages, use special functionsmodel_type()andpredict_model. http://uc-r.github.io/lime
explainer, use the ML model and feature distributions (bins) for the training data.Attrition for the H2O modelbin_continuous makes it easy to detect what causes a continuous feature to have a high feature weight in the lime explanationquantile_bins tells how to distribute observations within the bins. If TRUE, cuts will be evenly distributed within each binN_bins: 4-5 bins is typically sufficient. If N_bins = 4,bin_cuts` will have 5 cuts. The bins are then the observations that fall in between the cuts.explainer <- train_tbl %>% select(-Attrition) %>%
lime(model = automl_leader, bin_continuous = TRUE, N_bins = 4, quantile_bins = TRUE)
#explainerThe output is voluminous and omitted to save pages of printing! It includes how continuous variables are cut. This is important information.
explainer provides much data - be careful before printing.
explainLIME 6 steps for explanation:
n_labels: binary classification therefore just one labeln_features: number of important features to return to explain the record predictionn_permutations: the more test/evaluations, the better the explanations (typically). Lime will create 5000 localized linear models for evaluation.kernel_width: Affects the LIME linear model (R^2) and therefore should be tuned to ensure you get the best explanations - IMPORTANTAttrition is removed because the data argument must match the format the model requires to predict. Since H2O requires x to be without the target variable, it must be removed below.lime::explain because explain is a common function used in other packages.explanation <- test_tbl %>% slice(5) %>% select(-Attrition) %>%
lime::explain(explainer = explainer, n_labels = 1, n_features = 8, n_permutations = 5000,
kernel_width = 0.5)
explanation## model_type case label label_prob model_r2 model_intercept
## 1 classification 1 Yes 0.79 0.305 0.0699
## 2 classification 1 Yes 0.79 0.305 0.0699
## 3 classification 1 Yes 0.79 0.305 0.0699
## 4 classification 1 Yes 0.79 0.305 0.0699
## 5 classification 1 Yes 0.79 0.305 0.0699
## 6 classification 1 Yes 0.79 0.305 0.0699
## 7 classification 1 Yes 0.79 0.305 0.0699
## 8 classification 1 Yes 0.79 0.305 0.0699
## model_prediction feature feature_value feature_weight
## 1 0.492 OverTime 2 0.1783
## 2 0.492 NumCompaniesWorked 9 0.0887
## 3 0.492 JobLevel 1 0.0896
## 4 0.492 MonthlyIncome 2437 0.0799
## 5 0.492 StockOptionLevel 2 -0.0769
## 6 0.492 DistanceFromHome 17 0.0640
## 7 0.492 BusinessTravel 3 0.0569
## 8 0.492 YearsAtCompany 1 -0.0583
## feature_desc
## 1 OverTime = Yes
## 2 4 < NumCompaniesWorked
## 3 JobLevel = 1
## 4 MonthlyIncome <= 2935
## 5 StockOptionLevel = 1
## 6 13.8 < DistanceFromHome
## 7 BusinessTravel = Travel_Frequently
## 8 YearsAtCompany <= 3
## data
## 1 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 2 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 3 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 4 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 5 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 6 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 7 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 8 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## prediction
## 1 0.21, 0.79
## 2 0.21, 0.79
## 3 0.21, 0.79
## 4 0.21, 0.79
## 5 0.21, 0.79
## 6 0.21, 0.79
## 7 0.21, 0.79
## 8 0.21, 0.79
## [1] 0.305
Want R^2 to be as high as possible - optimize by adjusting kernel_width
In this case, kernel_width = 0.5 works relatively well. Trying 0.2 - 1.5 : 0.2 performed best
explanation <- test_tbl %>% slice(5) %>% select(-Attrition) %>%
lime::explain(explainer = explainer, n_labels = 1, n_features = 8,
n_permutations = 5000, kernel_width = 0.2)
explanation$model_r2[1]## [1] 0.315
explanation <- test_tbl %>% slice(5) %>% select(-Attrition) %>%
lime::explain(explainer = explainer, n_labels = 1, n_features = 8,
n_permutations = 5000, kernel_width = 1.0)
explanation$model_r2[1]## [1] 0.312
Explore other tuning parameters
Note we selected 8 features in the explainer so there are 8 records below.
## # A tibble: 8 x 6
## feature feature_value feature_weight feature_desc data prediction
## <chr> <dbl> <dbl> <chr> <lis> <list>
## 1 OverTime 2 0.184 OverTime = Yes <lis~ <list [2]>
## 2 JobLevel 1 0.0964 JobLevel = 1 <lis~ <list [2]>
## 3 NumCompa~ 9 0.0814 4 < NumCompanie~ <lis~ <list [2]>
## 4 StockOpt~ 2 -0.0758 StockOptionLeve~ <lis~ <list [2]>
## 5 Business~ 3 0.0773 BusinessTravel ~ <lis~ <list [2]>
## 6 Distance~ 17 0.0635 13.8 < Distance~ <lis~ <list [2]>
## 7 MonthlyI~ 2437 0.0601 MonthlyIncome <~ <lis~ <list [2]>
## 8 YearsAtC~ 1 -0.0557 YearsAtCompany ~ <lis~ <list [2]>
feature_value is the actual value used. Fro example, this record indicates 9 companies the employee has worked. There are 17 miles between home and work.kernel_width - R^2 values: 0.3: 0.303 0.4:0.294; 1.0:0.309; 1.5:0.294
In this case (values may vary in RMD), kernel_width = 1 produced the highest R^2 value.
ncol = 1 since we only have one record to examine - at this time.feature_weights in the table above.4 < NumCompaniesWorked is a label from continuous binning performed earlierModify the slice function for 20 observations by modifying slice.
explanation <- test_tbl %>% slice(1:20) %>% select(-Attrition) %>%
lime::explain(explainer = explainer, n_labels = 1, n_features = 8,
n_permutations = 5000, kernel_width = 0.2)
explanation %>% as.tibble()## # A tibble: 160 x 13
## model_type case label label_prob model_r2 model_intercept
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 classific~ 1 No 0.964 0.272 0.655
## 2 classific~ 1 No 0.964 0.272 0.655
## 3 classific~ 1 No 0.964 0.272 0.655
## 4 classific~ 1 No 0.964 0.272 0.655
## 5 classific~ 1 No 0.964 0.272 0.655
## 6 classific~ 1 No 0.964 0.272 0.655
## 7 classific~ 1 No 0.964 0.272 0.655
## 8 classific~ 1 No 0.964 0.272 0.655
## 9 classific~ 2 No 0.891 0.335 0.804
## 10 classific~ 2 No 0.891 0.335 0.804
## # ... with 150 more rows, and 7 more variables: model_prediction <dbl>,
## # feature <chr>, feature_value <dbl>, feature_weight <dbl>,
## # feature_desc <chr>, data <list>, prediction <list>
## [1] 0.239 0.335
Too much information. The plot is unreadable. Use another method - plot_explanations
- The No panel indicates the features that support or contradict attrition. Even when there is support for attrition, other factors balance the employee’s experience. - 5, 11, 20 are likely to leave since they are in the Yes panel. - Overtime = Yes is dark red on the No column contradicting a NO response - If use more than 20 cases, it becomes challenging to decipher. The table view might be easier in those cases. Can always dplyr::summarize on the case number to group individual employees
Extend the Lime plots with `theme_tq.
Single case:{#plotlimefeaturetq}
GlueExpressions enclosed by braces will be evaluated as R code. Long strings are broken by line and concatenated together. Leading white space and blank lines from the first and last lines are automatically trimmed. Example:
library(glue)# Helps with text
plot_features_tq <- function(explanation, ncol) {
data_transformed <- explanation %>% as.tibble() %>%
mutate(
feature_desc = as_factor(feature_desc) %>%
fct_reorder(abs(feature_weight), .desc = FALSE),
key = ifelse(feature_weight > 0, "Supports", "Contradicts") %>%
fct_relevel("Supports"),
case_text = glue("Case: {case}"),
label_text = glue("Label: {label}"),
prob_text = glue("Probability: {round(label_prob, 2)}"),
r2_text = glue("Explanation Fit: {model_r2 %>% round(2)}")) %>%
select(feature_desc, feature_weight, key, case_text:r2_text)
data_transformed %>% ggplot(aes(feature_desc, feature_weight, fill = key)) +
geom_bar(stat = "identity") + coord_flip() + theme_tq() + scale_fill_tq() +
labs(y = "Weight", x = "Feature") + theme(title = element_text(size = 9)) +
facet_wrap(~ case_text + label_text + prob_text + r2_text,
ncol = ncol, scales = "free")}New function example:
mutate(order = order_1 * 1000 + order_2) %>% A trick to get one order that preserves the hierarchy of feature (like Monthly_Income) and feature_value (like 9419) in the explanation tibble data.
rank simply ranks values from lowest to highestplot_explanations_tq <- function(explanation) {
data_transformed <- explanation %>% as.tibble() %>%
mutate(case = as_factor(case), order_1 = rank(feature)) %>%
group_by(feature) %>% mutate(order_2 = rank(feature_value)) %>%
ungroup() %>% mutate(order = order_1 * 1000 + order_2) %>%
mutate(feature_desc = as.factor(feature_desc) %>% fct_reorder(order, .desc = T)) %>%
select(case, feature_desc, feature_weight, label)
data_transformed %>% ggplot(aes(case, feature_desc)) +
geom_tile(aes(fill = feature_weight)) + facet_wrap(~ label) + theme_tq() +
scale_fill_gradient2(low = palette_light()[[2]], mid = "white", high = palette_light()[[1]]) +
theme(panel.grid = element_blank(), legend.position = "right",
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
labs(y = "Feature", x = "Case", fill = glue("Feature Weight"))}Note glue in labs statement above. With the added return, it stacks Feature and Weight label
This teaches you how to add value to organizations - by providing ROI-Driven Data Science! This is split in two:
Starting on Step 2 of the BSPF Encode Algorithms Step:
This marks the beginning of the CRISP-DM Phase 5: Evaluation.
Using ROI, there will be two solutions are developed:
Threshold Optimization and Sensitivity Analysis will be developed for optimization.
There are two key terms to understand that support the forthcoming code development:
All costs are annualized
Calculate the expected of changing all employees to no overtime.
3 Steps:
assess_attrition contains several functions:
assess_attritioncalculate_attrition_cost - This is the one that is needed.count_to_pctplot_attritionbind_cols adds columns from test.tbl to the three columns provided by automl_leader %>% h2o.predict(newdata = as.h2o(test_tbl)) %>% as.tibble()
predictions_with_OT_tbl <- automl_leader %>% h2o.predict(newdata = as.h2o(test_tbl)) %>% as.tibble() %>%
bind_cols(test_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime))
predictions_with_OT_tbl %>% head()## # A tibble: 6 x 6
## predict No Yes EmployeeNumber MonthlyIncome OverTime
## <fct> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 No 0.964 0.0359 228 9419 No
## 2 No 0.891 0.109 1278 13577 Yes
## 3 No 0.967 0.0330 1250 17779 No
## 4 No 0.953 0.0471 2065 5390 No
## 5 Yes 0.210 0.790 1767 2437 Yes
## 6 No 0.937 0.0631 1308 2372 No
Recall calculate_attrition_cost:
## function (n = 1, salary = 80000, separation_cost = 500, vacancy_cost = 10000,
## acquisition_cost = 4900, placement_cost = 3500, net_revenue_per_employee = 250000,
## workdays_per_year = 240, workdays_position_open = 40, workdays_onboarding = 60,
## onboarding_efficiency = 0.5)
## {
## direct_cost <- sum(separation_cost, vacancy_cost, acquisition_cost,
## placement_cost)
## productivy_cost <- net_revenue_per_employee/workdays_per_year *
## (workdays_position_open + workdays_onboarding * onboarding_efficiency)
## salary_benefit_reduction <- salary/workdays_per_year * workdays_position_open
## cost_per_employee <- direct_cost + productivy_cost - salary_benefit_reduction
## total_cost <- n * cost_per_employee
## return(total_cost)
## }
Some changes we can make:
net_revenue_per_employee so it can be manipulatedcost_of_policy_change = 0 because there is no change in the OT policy - keeping everything the same for the baselineYes column is the likelihood of leaving (when combined can get an expected attrition cost - coming soon)Yes and No are probabilities - Because cost_of_policy_change is 0, \(ExpectedCost = Yes * AttritionCost\)ev_with_OT_tbl <- predictions_with_OT_tbl %>%
mutate(attrition_cost = calculate_attrition_cost(
n = 1, salary = MonthlyIncome * 12, net_revenue_per_employee = 250000)) %>%
mutate(cost_of_policy_change = 0) %>%
mutate(expected_attrition_cost =
Yes * (attrition_cost + cost_of_policy_change) + No * (cost_of_policy_change))
ev_with_OT_tbl## # A tibble: 220 x 9
## predict No Yes EmployeeNumber MonthlyIncome OverTime
## <fct> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 No 0.964 0.0359 228 9419 No
## 2 No 0.891 0.109 1278 13577 Yes
## 3 No 0.967 0.0330 1250 17779 No
## 4 No 0.953 0.0471 2065 5390 No
## 5 Yes 0.210 0.790 1767 2437 Yes
## 6 No 0.937 0.0631 1308 2372 No
## 7 No 0.956 0.0444 18 2661 No
## 8 No 0.960 0.0403 460 6347 No
## 9 No 0.963 0.0370 1369 5363 No
## 10 No 0.959 0.0414 1040 6347 No
## # ... with 210 more rows, and 3 more variables: attrition_cost <dbl>,
## # cost_of_policy_change <dbl>, expected_attrition_cost <dbl>
Now calculate the total attrition cost with OT - the baseline case. Simple to to with summarize.
total_ev_with_OT_tbl <- ev_with_OT_tbl %>% summarize(total_expected_attrition_cost_0 = sum(expected_attrition_cost))
total_ev_with_OT_tbl## # A tibble: 1 x 1
## total_expected_attrition_cost_0
## <dbl>
## 1 3078803.
test_tbl to change all OT Yes to No. fct_recode makes this easy!test_without_OT_tbl <- test_tbl %>% mutate(OverTime = fct_recode(OverTime, "No" = "Yes"))
test_without_OT_tbl$OverTime## [1] No No No No No No No No No No No No No No No No No No No No No No No
## [24] No No No No No No No No No No No No No No No No No No No No No No No
## [47] No No No No No No No No No No No No No No No No No No No No No No No
## [70] No No No No No No No No No No No No No No No No No No No No No No No
## [93] No No No No No No No No No No No No No No No No No No No No No No No
## [116] No No No No No No No No No No No No No No No No No No No No No No No
## [139] No No No No No No No No No No No No No No No No No No No No No No No
## [162] No No No No No No No No No No No No No No No No No No No No No No No
## [185] No No No No No No No No No No No No No No No No No No No No No No No
## [208] No No No No No No No No No No No No No
## Levels: No
test_without_OT_tbl to make new predictions using the same code but using a different data file.rename OverTime so the OT values are side-by-sidepredictions_without_OT_tbl <- automl_leader %>% h2o.predict(newdata = as.h2o(test_without_OT_tbl)) %>%
as.tibble() %>%
bind_cols(
test_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime),
test_without_OT_tbl %>% select(OverTime)) %>%
rename(OverTime_0 = OverTime, OverTime_1 = OverTime1)
predictions_without_OT_tbl %>% head()## # A tibble: 6 x 7
## predict No Yes EmployeeNumber MonthlyIncome OverTime_0 OverTime_1
## <fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 No 0.964 0.0359 228 9419 No No
## 2 No 0.953 0.0465 1278 13577 Yes No
## 3 No 0.967 0.0330 1250 17779 No No
## 4 No 0.953 0.0471 2065 5390 No No
## 5 No 0.793 0.207 1767 2437 Yes No
## 6 No 0.937 0.0631 1308 2372 No No
Note all the values for Overtime_1 = No. The new overtime values for Overtime_1 will have an impact on the expected value calculation.
Yes to No, there is a cost associated with this:
avg_overtime_pct. This is a business decision. The goal is to update the calculation to reflect a new policy based on realistic parameters the organization supports.avg_overtime_pct <- 0.10
ev_without_OT_tbl <- predictions_without_OT_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
net_revenue_per_employee = 250000)) %>%
mutate(cost_of_policy_change = case_when(
OverTime_0 =="Yes" & OverTime_1 =="No" ~ avg_overtime_pct * attrition_cost, TRUE ~ 0) ) %>%
mutate(expected_attrition_cost = Yes * (attrition_cost + cost_of_policy_change) + No * (cost_of_policy_change))
ev_without_OT_tbl %>% select(c(MonthlyIncome:expected_attrition_cost))## # A tibble: 220 x 6
## MonthlyIncome OverTime_0 OverTime_1 attrition_cost cost_of_policy_~
## <dbl> <fct> <fct> <dbl> <dbl>
## 1 9419 No No 72979. 0
## 2 13577 Yes No 64663. 6466.
## 3 17779 No No 56259. 0
## 4 5390 No No 81037. 0
## 5 2437 Yes No 86943. 8694.
## 6 2372 No No 87073. 0
## 7 2661 No No 86495. 0
## 8 6347 No No 79123. 0
## 9 5363 No No 81091. 0
## 10 6347 No No 79123. 0
## # ... with 210 more rows, and 1 more variable:
## # expected_attrition_cost <dbl>
total_ev_without_OT_tbl <- ev_without_OT_tbl %>% summarise(
total_expected_attrition_cost_1 = sum(expected_attrition_cost))
total_ev_without_OT_tbl## # A tibble: 1 x 1
## total_expected_attrition_cost_1
## <dbl>
## 1 2666825.
Recall the total attrition cost for the baseline (with OT) was 3078802.893 - the cost without OT is lower!
bind_cols(total_ev_with_OT_tbl, total_ev_without_OT_tbl) %>%
mutate(
savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
pct_savings = savings / total_expected_attrition_cost_0)## # A tibble: 1 x 4
## total_expected_attrition_~ total_expected_attrition~ savings pct_savings
## <dbl> <dbl> <dbl> <dbl>
## 1 3078803. 2666825. 411978. 0.134
By eliminating OT, there is a potential to save over $400k by eliminating OT. Recall we are examining only 200 records. To annualize this, consider adding the records from train with an additional 1250 employees:
\((train + test)/test = (1250 + 220) / 220 = 6.7; 6.7 * 400000 = 2,680,000\)
The previous analysis of the baseline compared to a policy of no OT for everyone. If the OT policy was targeted towards those that are likely to leave, then perhaps greater savings might be realized.
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.485216626479869:
## No Yes Error Rate
## No 181 3 0.016304 =3/184
## Yes 16 20 0.444444 =16/36
## Totals 197 23 0.086364 =19/220
The confusion matrix above is based on a specific F1 threshold
What needs to be evaluated is how the confusion matrix changes over different values of F1. Typically seek to reduce false negatives (because they are often more expensive).
h2o.metric returns the classifier performance metrics by threshold.rates_by_threshold_tbl <- performance_h2o %>% h2o.metric() %>% as.tibble()
rates_by_threshold_tbl %>% glimpse()## Observations: 220
## Variables: 20
## $ threshold <dbl> 0.966, 0.899, 0.899, 0.844, 0.839, 0.8...
## $ f1 <dbl> 0.0541, 0.1053, 0.1538, 0.2000, 0.2439...
## $ f2 <dbl> 0.0345, 0.0685, 0.1020, 0.1351, 0.1678...
## $ f0point5 <dbl> 0.125, 0.227, 0.312, 0.385, 0.446, 0.5...
## $ accuracy <dbl> 0.841, 0.845, 0.850, 0.855, 0.859, 0.8...
## $ precision <dbl> 1.000, 1.000, 1.000, 1.000, 1.000, 1.0...
## $ recall <dbl> 0.0278, 0.0556, 0.0833, 0.1111, 0.1389...
## $ specificity <dbl> 1.000, 1.000, 1.000, 1.000, 1.000, 1.0...
## $ absolute_mcc <dbl> 0.153, 0.217, 0.266, 0.308, 0.345, 0.3...
## $ min_per_class_accuracy <dbl> 0.0278, 0.0556, 0.0833, 0.1111, 0.1389...
## $ mean_per_class_accuracy <dbl> 0.514, 0.528, 0.542, 0.556, 0.569, 0.5...
## $ tns <dbl> 184, 184, 184, 184, 184, 184, 184, 184...
## $ fns <dbl> 35, 34, 33, 32, 31, 30, 29, 28, 27, 27...
## $ fps <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ tps <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 11, ...
## $ tnr <dbl> 1.000, 1.000, 1.000, 1.000, 1.000, 1.0...
## $ fnr <dbl> 0.972, 0.944, 0.917, 0.889, 0.861, 0.8...
## $ fpr <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ tpr <dbl> 0.0278, 0.0556, 0.0833, 0.1111, 0.1389...
## $ idx <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
Note how the threshold changes other key values.
Only a subset of this data is needed for further analysis.
## # A tibble: 220 x 6
## threshold f1 tnr fnr fpr tpr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.966 0.0541 1 0.972 0 0.0278
## 2 0.899 0.105 1 0.944 0 0.0556
## 3 0.899 0.154 1 0.917 0 0.0833
## 4 0.844 0.20 1 0.889 0 0.111
## 5 0.839 0.244 1 0.861 0 0.139
## 6 0.816 0.286 1 0.833 0 0.167
## 7 0.815 0.326 1 0.806 0 0.194
## 8 0.812 0.364 1 0.778 0 0.222
## 9 0.790 0.4 1 0.75 0 0.25
## 10 0.773 0.391 0.995 0.75 0.00543 0.25
## # ... with 210 more rows
It is easy to get the values for the max F1.
## # A tibble: 1 x 6
## threshold f1 tnr fnr fpr tpr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.485 0.678 0.984 0.444 0.0163 0.556
(slice makes sure only 1 F1 record is returned - there can be more than one.)
Visualize how the critical confusion matrix values vary by threshold
gather to go from wide to long format - required for ggplot groupsfactor_key = TRUE keeps the order of tnr:tpr
fct_reorder2(key, threshold, value) used when plotting using x and y axis features to control the legend
Threshold and the y axis uses Value; key controls the legendkeyrates_by_threshold_tbl %>% select(threshold, f1, tnr:tpr) %>%
gather(key = "key", value = "value", tnr:tpr, factor_key = TRUE) %>%
mutate(key = fct_reorder2(key, threshold, value)) %>% ggplot(aes(threshold, value, color = key)) +
geom_point() + geom_smooth() + theme_tq() + scale_color_tq() +
theme(legend.position = "right") + labs(title = "Expected Rates", y = "Value", x = "Threshold")Note at
Threshold = 1, the order of the lines from top to bottom match the order of the legend!
FPs and FNs result in costs
We can copy code developed before in expected_value_no_OT_policy.R
predictions_with_OT_tbl <- automl_leader %>% h2o.predict(newdata = as.h2o(test_tbl)) %>%
as.tibble() %>% bind_cols(test_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime))ev_with_OT_tbl <- predictions_with_OT_tbl %>%
mutate(attrition_cost = calculate_attrition_cost(
n = 1, salary = MonthlyIncome * 12, net_revenue_per_employee = 250000)) %>%
mutate(cost_of_policy_change = 0) %>%
mutate(expected_attrition_cost =
Yes * (attrition_cost + cost_of_policy_change) +
No * (cost_of_policy_change))
# Attrition Cost occurs only if the employee leaves. The Yes column is the liklihood of leaving.
# When combined, expected attrition cost results
total_ev_with_OT_tbl <- ev_with_OT_tbl %>% summarise(total_expected_attrition_cost_0 = sum(expected_attrition_cost))
total_ev_with_OT_tbl## # A tibble: 1 x 1
## total_expected_attrition_cost_0
## <dbl>
## 1 3078803.
Review a previous dataframe, rates_by_threshold_tbl:
## # A tibble: 220 x 20
## threshold f1 f2 f0point5 accuracy precision recall specificity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.966 0.0541 0.0345 0.125 0.841 1 0.0278 1
## 2 0.899 0.105 0.0685 0.227 0.845 1 0.0556 1
## 3 0.899 0.154 0.102 0.312 0.85 1 0.0833 1
## 4 0.844 0.20 0.135 0.385 0.855 1 0.111 1
## 5 0.839 0.244 0.168 0.446 0.859 1 0.139 1
## 6 0.816 0.286 0.200 0.5 0.864 1 0.167 1
## 7 0.815 0.326 0.232 0.547 0.868 1 0.194 1
## 8 0.812 0.364 0.263 0.588 0.873 1 0.222 1
## 9 0.790 0.4 0.294 0.625 0.877 1 0.25 1
## 10 0.773 0.391 0.292 0.592 0.873 0.9 0.25 0.995
## # ... with 210 more rows, and 12 more variables: absolute_mcc <dbl>,
## # min_per_class_accuracy <dbl>, mean_per_class_accuracy <dbl>,
## # tns <dbl>, fns <dbl>, fps <dbl>, tps <dbl>, tnr <dbl>, fnr <dbl>,
## # fpr <dbl>, tpr <dbl>, idx <int>
Get the metrics for the max F1
max_f1_tbl <- rates_by_threshold_tbl %>% select(threshold, f1, tnr:tpr) %>% filter(f1 == max(f1)) %>% slice(1)
max_f1_tbl## # A tibble: 1 x 6
## threshold f1 tnr fnr fpr tpr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.485 0.678 0.984 0.444 0.0163 0.556
Collect the values for each variable and save to its own variable.
tnr <- max_f1_tbl$tnr
fnr <- max_f1_tbl$fnr
fpr <- max_f1_tbl$fpr
tpr <- max_f1_tbl$tpr
threshold <- max_f1_tbl$thresholdThe max F1 0.485 is the harmonic mean that minimize the FPR and the FNR:
Any records with a max F1, 0.485, or higher focus on them to change OT to No. To do this, need to modify the test.tbl. Anyone with \(Yes >= 0.28 & Overtime = Yes\) then toggle \(Overtime = No\). We are selectively toggling Overtime from Yes to No for employees have a Yes value to No.
add_column adds column using a vector of values. From tibbleYes >= threshold ~ factor("No", levels = levels(test_tbl$OverTime))
test_tbl# Anyone with a risk higher than threshold, convert their OT to NO
test_targeted_OT_tbl <- test_tbl %>% add_column(Yes = predictions_with_OT_tbl$Yes) %>%
mutate(OverTime = case_when(
Yes >= threshold ~ factor("No", levels = levels(test_tbl$OverTime)), TRUE ~ OverTime)) %>%
select(-Yes)
test_targeted_OT_tbl## # A tibble: 220 x 32
## Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## <dbl> <fct> <fct> <dbl> <fct> <dbl>
## 1 30 No Travel_Rarely 1339 Sales 5
## 2 55 No Non-Travel 177 Research ~ 8
## 3 54 No Travel_Rarely 685 Research ~ 3
## 4 49 No Travel_Freque~ 1023 Sales 2
## 5 43 Yes Travel_Freque~ 807 Research ~ 17
## 6 58 No Travel_Rarely 848 Research ~ 23
## 7 34 No Travel_Rarely 1346 Research ~ 19
## 8 37 No Travel_Rarely 1192 Research ~ 5
## 9 35 No Travel_Rarely 817 Research ~ 1
## 10 50 No Non-Travel 145 Sales 1
## # ... with 210 more rows, and 26 more variables: Education <fct>,
## # EducationField <fct>, EmployeeNumber <dbl>,
## # EnvironmentSatisfaction <fct>, Gender <fct>, HourlyRate <dbl>,
## # JobInvolvement <fct>, JobLevel <fct>, JobRole <fct>,
## # JobSatisfaction <fct>, MaritalStatus <fct>, MonthlyIncome <dbl>,
## # MonthlyRate <dbl>, NumCompaniesWorked <dbl>, OverTime <fct>,
## # PercentSalaryHike <dbl>, PerformanceRating <fct>,
## # RelationshipSatisfaction <fct>, StockOptionLevel <fct>,
## # TotalWorkingYears <dbl>, TrainingTimesLastYear <dbl>,
## # WorkLifeBalance <fct>, YearsAtCompany <dbl>, YearsInCurrentRole <dbl>,
## # YearsSinceLastPromotion <dbl>, YearsWithCurrManager <dbl>
Now the data is prepared, it is time to generate predictions now that we have the test_targeted_OT_tbl.
Running h2o.predict on the new data will output the class probabilities of Attrtition = Yes implementing the policy changes.
predictions_targeted_OT_tbl <- automl_leader %>%
h2o.predict(newdata = as.h2o(test_targeted_OT_tbl)) %>% as.tibble() %>%
bind_cols(
test_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime),
test_targeted_OT_tbl %>% select(OverTime)) %>%
rename(OverTime_0 = OverTime, OverTime_1 = OverTime1)## # A tibble: 220 x 7
## predict No Yes EmployeeNumber MonthlyIncome OverTime_0 OverTime_1
## <fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 No 0.964 0.0359 228 9419 No No
## 2 No 0.891 0.109 1278 13577 Yes Yes
## 3 No 0.967 0.0330 1250 17779 No No
## 4 No 0.953 0.0471 2065 5390 No No
## 5 No 0.793 0.207 1767 2437 Yes No
## 6 No 0.937 0.0631 1308 2372 No No
## 7 No 0.956 0.0444 18 2661 No No
## 8 No 0.960 0.0403 460 6347 No No
## 9 No 0.963 0.0370 1369 5363 No No
## 10 No 0.959 0.0414 1040 6347 No No
## # ... with 210 more rows
Before:
## predict No Yes
## 1 No 0.964 0.0359
## 2 No 0.891 0.1090
## 3 No 0.967 0.0330
## 4 No 0.953 0.0471
## 5 Yes 0.210 0.7903
## 6 No 0.937 0.0631
##
## [220 rows x 3 columns]
After:
## predict No Yes
## 1 No 0.964 0.0359
## 2 No 0.891 0.1090
## 3 No 0.967 0.0330
## 4 No 0.953 0.0471
## 5 No 0.793 0.2070
## 6 No 0.937 0.0631
##
## [220 rows x 3 columns]
Record #5 changed from 0.832 to 0.264 and prediction changed from Yes to No.
Predictions have been made and saved in predictions_targeted_OT_tbl. Next step is to calculate the costs of the new OT targeted policy. Some of the code is familiar and copied from previous work.
cb_tn = cost of policy change. This is the cost of the employee staying controlled by the cost of policy change logic because they are being targeted.cb_fp = cost of policy change. This is the cost of predicting an employee to leave but the employee stays. This is controlled by the cost of policy change logic because they were targeted.cb_tp = cost of policy change + attrition cost. This is controlled by the cost of policy change logic AND attrition costscb_fn = cost of policy change + attrition cost. This is the cost when the employee leaves even though he was predicted to stay. This is the most expensive case. This is controlled by the cost of policy change logic AND attrition costs.expected_attrition_cost calculates the overall expected cost by multiplying the probabilities of the Yes or No by the costs calculated above cb_**.Memorize/understand TN and FP go together - they add to 1 - and TP and FN go together because they too sum to 1.
Might be helpful to review the Expected Rates plot above. It makes intuitive sense is you think about it.
avg_overtime_pct <- 0.10 # a business decision
ev_targeted_OT_tbl <- predictions_targeted_OT_tbl %>%
mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
net_revenue_per_employee = 250000)) %>%
mutate(cost_of_policy_change = case_when(
OverTime_0 == "Yes" & OverTime_1 == "No" ~ attrition_cost * avg_overtime_pct, TRUE ~ 0)) %>%
mutate(
cb_tn = cost_of_policy_change, cb_fp = cost_of_policy_change,
cb_tp = cost_of_policy_change + attrition_cost, cb_fn = cost_of_policy_change + attrition_cost,
expected_attrition_cost = Yes * (tpr*cb_tp + fnr*cb_fn) + No * (tnr*cb_tn + fpr*cb_fp))
ev_targeted_OT_tbl## # A tibble: 220 x 14
## predict No Yes EmployeeNumber MonthlyIncome OverTime_0 OverTime_1
## <fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 No 0.964 0.0359 228 9419 No No
## 2 No 0.891 0.109 1278 13577 Yes Yes
## 3 No 0.967 0.0330 1250 17779 No No
## 4 No 0.953 0.0471 2065 5390 No No
## 5 No 0.793 0.207 1767 2437 Yes No
## 6 No 0.937 0.0631 1308 2372 No No
## 7 No 0.956 0.0444 18 2661 No No
## 8 No 0.960 0.0403 460 6347 No No
## 9 No 0.963 0.0370 1369 5363 No No
## 10 No 0.959 0.0414 1040 6347 No No
## # ... with 210 more rows, and 7 more variables: attrition_cost <dbl>,
## # cost_of_policy_change <dbl>, cb_tn <dbl>, cb_fp <dbl>, cb_tp <dbl>,
## # cb_fn <dbl>, expected_attrition_cost <dbl>
Review the table above with a few examples:
1_st_ record: 4% Yes, no overtime and therefore nothing to adjust. Because change of policy is not a factor, TN and FP are 0 - they are not leaving. If he does leave, (TP, FN) then these cost benefit features are the same as the attrition costs. The expected attrition cost is low because the chance of leaving is low and they are not targeted.
The real logic is only leveraged when a record’s threshold - {r threshold} - is exceeded. If someone is working overtime and the threshold is not met, then the overtime stays Yes. Look at record 2 and compare to record 5. If record 2 were to leave, then the full cost would be 64662.667 rather than the expected attrition cost based on probabilities - {r ev_targeted_OT_tbl[2, "expected_attrition_cost"]}.
The final step is to calculate the final expected value.
total_ev_targeted_OT_tbl <- ev_targeted_OT_tbl %>%
summarize(total_expected_attrition_cost_1 = sum(expected_attrition_cost))
total_ev_targeted_OT_tbl ## # A tibble: 1 x 1
## total_expected_attrition_cost_1
## <dbl>
## 1 2683388.
savings_tbl <- bind_cols(total_ev_with_OT_tbl, total_ev_targeted_OT_tbl) %>%
mutate(savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
pct_savings = savings / total_expected_attrition_cost_0)
savings_tbl## # A tibble: 1 x 4
## total_expected_attrition_~ total_expected_attrition~ savings pct_savings
## <dbl> <dbl> <dbl> <dbl>
## 1 3078803. 2683388. 395415. 0.128
Using the targeted OT policy saves ~k500. Recall the targeted OT policy is based on the max F1 threshold. The question next is: Does the max F1 threshold provide the greatest savings?
Is there a better threshold that can provide more savings than the savings using the max F1? We should always check to see this there is a more optimal threshold.
This requires 2 steps:
calculate_savings_by_threshold calculates and returns the savings when the user provides data, h2o model, the threshold and the expected rates (tnr, fpr, fnr, tpr).
The code below is familiar; this time code is combined into a single function. All that is returned is a single value - the savings.
bind_cols is a dplyr function that does what it says. In this case, the predict, No, and Yes columns returned by h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() is bound with the notes 3 features from the original data in data/data_0_tbl and saved as pred_0_tbl.ev_0_tbl takes pred_0_tbl with the predictions and adds some more features that have been covered before.dplyr summary code, the expected value is saved in total_ev_0_tbltotal_ev_0_tbl simply provides the integer for the expected attrition cost.data_1_tbl get s new new column Yes to overwrite the overtime feature
Yes to Nocalculate_savings_by_threshold <- function(data, h2o_model, threshold = 0,
tnr = 0, fpr = 1, fnr = 0, tpr = 1) {
data_0_tbl <- as.tibble(data)
# Calculating Expected Value With OT
pred_0_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() %>%
bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime))
ev_0_tbl <- pred_0_tbl %>%
mutate(attrition_cost = calculate_attrition_cost(n = 1,salary = MonthlyIncome * 12,
net_revenue_per_employee = 250000)) %>%
mutate(cost_of_policy_change = 0) %>%
mutate(expected_attrition_cost =
Yes * (attrition_cost + cost_of_policy_change) + No * (cost_of_policy_change))
total_ev_0_tbl <- ev_0_tbl %>% summarise(
total_expected_attrition_cost_0 = sum(expected_attrition_cost))
#Calculating Expected Value With Targeted OT
data_1_tbl <- data_0_tbl %>% add_column(Yes = pred_0_tbl$Yes) %>%
mutate( OverTime = case_when(
Yes >= threshold ~ factor("No", levels = levels(data_0_tbl$OverTime)),
TRUE ~ OverTime)) %>% select(-Yes)
pred_1_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_1_tbl)) %>% as.tibble() %>%
bind_cols(data_0_tbl %>%
select(EmployeeNumber, MonthlyIncome, OverTime), data_1_tbl %>%
select(OverTime)) %>% rename(OverTime_0 = OverTime, OverTime_1 = OverTime1)
avg_overtime_pct <- 0.10
ev_1_tbl <- pred_1_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1,
salary = MonthlyIncome * 12,
net_revenue_per_employee = 250000)) %>%
mutate(cost_of_policy_change = case_when(
OverTime_1 == "No" & OverTime_0 == "Yes" ~ attrition_cost * avg_overtime_pct,
TRUE ~ 0))%>%
mutate(cb_tn = cost_of_policy_change, cb_fp = cost_of_policy_change,
cb_fn = attrition_cost + cost_of_policy_change,
cb_tp = attrition_cost + cost_of_policy_change,
expected_attrition_cost = Yes * (tpr*cb_tp + fnr*cb_fn) + No * (tnr*cb_tn + fpr*cb_fp))
total_ev_1_tbl <- ev_1_tbl %>%
summarise(total_expected_attrition_cost_1 = sum(expected_attrition_cost))
#Savings Calculation
savings_tbl <- bind_cols(total_ev_0_tbl, total_ev_1_tbl) %>%
mutate(
savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
pct_savings = savings / total_expected_attrition_cost_0)
return(savings_tbl$savings)}Helpful to recall this plot to understand the values from the confusion matrix.
Yes to No## [1] 411978
So the expected savings for a no OT policy is ~$410k
Yes to No.## [1] 0
A savings of $0 makes sense because a threshold = 1 is basically not changing anything. So if there are no changes, there are no savings.
rates_by_threshold_tbl %>% select(threshold, f1, tnr:tpr) %>% filter(f1 == max(f1))
max_f1_savings <- calculate_savings_by_threshold(test_tbl, automl_leader, threshold = max_f1_tbl$threshold,
tnr = max_f1_tbl$tnr, fpr = max_f1_tbl$fpr, fnr = max_f1_tbl$fnr, tpr = max_f1_tbl$tpr)## [1] 395415
Time to now iteratively discover the optimal threshold to find the max savings. Will reuse parts of rates_by_threshold_tbl:
## # A tibble: 6 x 5
## threshold tnr fnr fpr tpr
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.966 1 0.972 0 0.0278
## 2 0.899 1 0.944 0 0.0556
## 3 0.899 1 0.917 0 0.0833
## 4 0.844 1 0.889 0 0.111
## 5 0.839 1 0.861 0 0.139
## 6 0.816 1 0.833 0 0.167
seq is used to use a portion of the thresholds.
length.out limits the selection to 20 selections from 1 to 200 whole numbers with round(0)pmap_dbl When mutate is used with purrr::map allows for row-wise mapping for iteration {#secondMap}
map alone, column-wise mapping is provided. {#firstMap}pmap functions that use list inputs (.l) arguments, which allow for any number of inputs to be mapped
.l is required; it stores a list that map the arguments to the column in the dataframepmap_dbl returns a single numerical value as opposed to ’pmap` that returns a list by defaultpurr::map functions are variants of map() iterate over multiple arguments in parallelthreshold, tnr, fnr, fpr, tpr are all used to to calculate savingscalculate_savings_by_threshold has the following arguments:
function(data, h2o_model, threshold = 0, tnr = 0, fpr = 1, fnr = 0, tpr = 1).f is a required argument just like .l. - - partial allows you to preload a function with arguments that do not change. {#partial} - in calculate_savings_by_threshold the arguments data and h2o_model do not change - this is very useful for mapping when arguments like these never change during the iterationsmpl <- seq(1, 220, length.out = 20) %>% round(digits = 0)
rates_by_threshold_optimized_tbl <- rates_by_threshold_tbl %>%
select(threshold, tnr:tpr) %>%
slice(smpl) %>%
mutate(
savings = pmap_dbl(
.l = list(
threshold = threshold,
tnr = tnr,
fnr = fnr,
fpr = fpr,
tpr = tpr ),
.f = partial(calculate_savings_by_threshold, data = test_tbl, h2o_model = automl_leader)))## # A tibble: 20 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.966 1 0.972 0 0.0278 11918.
## 2 0.737 0.995 0.667 0.00543 0.333 249517.
## 3 0.450 0.978 0.444 0.0217 0.556 417371.
## 4 0.315 0.929 0.361 0.0707 0.639 456590.
## 5 0.225 0.891 0.25 0.109 0.75 504173.
## 6 0.181 0.826 0.25 0.174 0.75 539867.
## 7 0.128 0.772 0.222 0.228 0.778 538986.
## 8 0.113 0.712 0.194 0.288 0.806 529153.
## 9 0.0930 0.652 0.194 0.348 0.806 517312.
## 10 0.0737 0.598 0.139 0.402 0.861 513145.
## 11 0.0648 0.543 0.111 0.457 0.889 490610.
## 12 0.0570 0.478 0.111 0.522 0.889 477937.
## 13 0.0513 0.424 0.0833 0.576 0.917 464770.
## 14 0.0471 0.359 0.0833 0.641 0.917 437500.
## 15 0.0431 0.310 0.0278 0.690 0.972 429991.
## 16 0.0403 0.25 0 0.75 1 417023.
## 17 0.0387 0.190 0 0.810 1 417023.
## 18 0.0370 0.125 0 0.875 1 417023.
## 19 0.0345 0.0652 0 0.935 1 411978.
## 20 0.0312 0 0 1 1 411978.
After using the purrr functions, data is available for plotting.{#expand_limits}{#scale_x_continuous}
Max F1 does not provide the maximum savings!
Added this to ensure the title is accurate
myPctMax <- rates_by_threshold_optimized_tbl %>% filter(savings == max(savings)) %>% select(threshold)
myPctMax <- scales::percent(as.numeric(myPctMax))rates_by_threshold_optimized_tbl %>% ggplot(aes(threshold, savings)) +
geom_line(color = palette_light()[[1]]) + geom_point(color = palette_light()[[1]]) +
# Optimal Point
geom_point(shape = 21, size = 5, color = palette_light()[[3]],
data = rates_by_threshold_optimized_tbl %>% filter(savings == max(savings))) +
geom_label(aes(label = scales::dollar(savings)), vjust = -1, color = palette_light()[[3]],
data = rates_by_threshold_optimized_tbl %>% filter(savings == max(savings))) +
# F1 Max
geom_vline(xintercept = max_f1_tbl$threshold, color = palette_light()[[5]], size = 2) +
annotate(geom = "label", label = scales::dollar(max_f1_savings),
x = max_f1_tbl$threshold, y = max_f1_savings, vjust = -1, color = palette_light()[[1]]) +
# No OT Policy
geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl %>%
filter(threshold == min(threshold))) +
geom_label(aes(label = scales::dollar(savings)), vjust = -1, color = palette_light()[[2]],
data = rates_by_threshold_optimized_tbl %>% filter(threshold == min(threshold))) +
# Do Nothing Policy
geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl %>%
filter(threshold == max(threshold))) +
geom_label(aes(label = scales::dollar(round(savings, 0))), vjust = -1, color = palette_light()[[2]],
data = rates_by_threshold_optimized_tbl %>% filter(threshold == max(threshold))) +
# Aesthestics
theme_tq() + expand_limits(x = c(-.1, 1.1), y = 8e5) +
scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, by = 0.2)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = glue("Optimization Results: Expected Savings Maximized At {myPctMax}"),
x = "Threshold (%)", y = "Savings")## # A tibble: 1 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0312 0 0 1 1 411978.
This policy will reduce turnover, particularly with the higher OT employees.
Recall this is just for a sample of the population - the savings organizationally are much larger
## # A tibble: 1 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.181 0.826 0.25 0.174 0.75 539867.
When FNs cost more than FP, the threshold will be lower than the max F1
In this case, reduce the threshold from the Max F1 to 18.053
The calculation will be close but not exact. Recall there are only 20 records in rates_by_threshold_optimized_tbl so the closest record is used for the table below.
#http://r.789695.n4.nabble.com/Find-the-closest-value-in-a-list-or-matrix-td838131.html
#which(abs(x-your.number)==min(abs(x-your.number)))
x <- rates_by_threshold_optimized_tbl$threshold
your.number <- max_f1_tbl$threshold
tmp_row_index <- which(abs(x-your.number)==min(abs(x-your.number)))
rates_by_threshold_optimized_tbl[5,]## # A tibble: 1 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.225 0.891 0.25 0.109 0.75 504173.
## # A tibble: 1 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.966 1 0.972 0 0.0278 11918.
Identifying the threshold that maximizes savings is good, but imperfect. Assumptions have been made.
You will never have perfect information therefore assumptions must be made
Even though you must make assumptions, you can still try different values to test your assumptions. This is why you do sensitivity analysis.
Next step is to understand the impact these assumptions have on expected savings. Develop a profitability heatmap illustrating the range of values including best, worst and most likely cases and effect on expected savings.
Will create a function then use purrr to iterate through the function.
calculate_savings_by_threshold_2 is a variant of calculate_savings_by_threshold that has avg_overtime_pct and net_revenue_per_employee as inputs. There are only a few changes. . .
avg_overtime_pct = 0.10 and net_revenue_per_employee = 250000 Note the default values from the confusion matrix:net_revenue_per_employee = net_revenue_per_employee is added~ attrition_cost * avg_overtime_pct is added###calculate_savings_by_thresh_2 {#calculate_savings_by_thresh_2}
calculate_savings_by_threshold_2 <- function(data, h2o_model, threshold = 0, tnr = 0, fpr = 1, fnr = 0, tpr = 1,
avg_overtime_pct = 0.10, net_revenue_per_employee = 250000) {
data_0_tbl <- as.tibble(data)
# 4.1 Calculating Expected Value With OT
pred_0_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() %>%
bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime))
ev_0_tbl <- pred_0_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
# Changed in _2 ----
net_revenue_per_employee = net_revenue_per_employee) ) %>%
mutate(cost_of_policy_change = 0) %>%
mutate(expected_attrition_cost = Yes * (attrition_cost + cost_of_policy_change) +
No * (cost_of_policy_change))
total_ev_0_tbl <- ev_0_tbl %>%
summarise(total_expected_attrition_cost_0 = sum(expected_attrition_cost))
# 4.2 Calculating Expected Value With Targeted OT
data_1_tbl <- data_0_tbl %>% add_column(Yes = pred_0_tbl$Yes) %>%
mutate(OverTime = case_when(Yes >= threshold ~ factor("No", levels = levels(data_0_tbl$OverTime)),
TRUE ~ OverTime)) %>% select(-Yes)
pred_1_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_1_tbl)) %>% as.tibble() %>%
bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime), data_1_tbl %>% select(OverTime)) %>%
rename(OverTime_0 = OverTime, OverTime_1 = OverTime1)
ev_1_tbl <- pred_1_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
# Changed in _2 ----
net_revenue_per_employee = net_revenue_per_employee)) %>%
mutate(cost_of_policy_change = case_when(OverTime_1 == "No" & OverTime_0 == "Yes"
~ attrition_cost * avg_overtime_pct, TRUE ~ 0))%>%
mutate(cb_tn = cost_of_policy_change, cb_fp = cost_of_policy_change,
cb_fn = attrition_cost + cost_of_policy_change, cb_tp = attrition_cost + cost_of_policy_change,
expected_attrition_cost = Yes * (tpr*cb_tp + fnr*cb_fn) + No * (tnr*cb_tn + fpr*cb_fp))
total_ev_1_tbl <- ev_1_tbl %>%
summarise(total_expected_attrition_cost_1 = sum(expected_attrition_cost))
# 4.3 Savings Calculation
savings_tbl <- bind_cols(total_ev_0_tbl, total_ev_1_tbl) %>%
mutate(savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
pct_savings = savings / total_expected_attrition_cost_0)
return(savings_tbl$savings)}Time to iterate through calculate_savings_by_threshold_2.
First, use the threshold percentage that optimizes savings for all the iterations. We developed this before. This will be used to set the values in the sensitivity analysis.
max_savings_rates_tbl <- rates_by_threshold_optimized_tbl %>% filter(savings == max(savings))
max_savings_rates_tbl## # A tibble: 1 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.181 0.826 0.25 0.174 0.75 539867.
calculate_savings_by_threshold_2(
data = test_tbl,
h2o_model = automl_leader,
threshold = max_savings_rates_tbl$threshold,
tnr = max_savings_rates_tbl$tnr,
fnr = max_savings_rates_tbl$fnr,
fpr = max_savings_rates_tbl$fpr,
tpr = max_savings_rates_tbl$tpr)## [1] 539867
Because many of the values will be static, build a partial function using the code above as a template. This is so powerful and yet so simple. Gotta love partial.
calculate_savings_by_threshold_2_preloaded <- partial(
calculate_savings_by_threshold_2,
# Function Arguments
data = test_tbl,
h2o_model = automl_leader,
threshold = max_savings_rates_tbl$threshold,
tnr = max_savings_rates_tbl$tnr,
fnr = max_savings_rates_tbl$fnr,
fpr = max_savings_rates_tbl$fpr,
tpr = max_savings_rates_tbl$tpr)Test the partial function created. This is really helpful - it simplifies our function so we only need to supply the parameters we are iterating on with the sensitivity analysis.
## [1] 539867
It works - the same value was returned. The goal is to understand how this value changes with revenue per employee and overtime.
seq
cross_df from purr is really helpful. From a list of variables, it produces all combinations of the list. Super useful for grid search and sensitivity analysis. A sample is provided below.{#cross_df}
cross_df can work with as many variables you want!pmap_dbl is used again to map a function by rows - this time the rows are avg_overtime_pct and net_revenue_per_employee
map2 or map2_dbl would also work however they accept only 2 variable mappings so pmap is more flexiblepmap takes 2 arguments: .l and .f
.l argument is getting its values from the cross_df output that has all the combination of valuessample_n(list(avg_overtime_pct = seq(0.05, 0.30, by = 0.05),
net_revenue_per_employee = seq(200000, 400000, by = 50000)) %>% cross_df(), 10) %>%
arrange(avg_overtime_pct, net_revenue_per_employee)## # A tibble: 10 x 2
## avg_overtime_pct net_revenue_per_employee
## <dbl> <dbl>
## 1 0.05 200000
## 2 0.05 250000
## 3 0.1 200000
## 4 0.1 350000
## 5 0.1 400000
## 6 0.2 400000
## 7 0.25 250000
## 8 0.3 250000
## 9 0.3 300000
## 10 0.3 350000
sensitivity_tbl <- list(
avg_overtime_pct = seq(0.05, 0.30, by = 0.05),
net_revenue_per_employee = seq(200000, 400000, by = 50000)) %>%
cross_df() %>% mutate(savings = pmap_dbl(
.l = list(avg_overtime_pct = avg_overtime_pct,
net_revenue_per_employee = net_revenue_per_employee),
.f = calculate_savings_by_threshold_2_preloaded))## # A tibble: 6 x 3
## avg_overtime_pct net_revenue_per_employee savings
## <dbl> <dbl> <dbl>
## 1 0.05 200000 552741.
## 2 0.1 200000 446819.
## 3 0.15 200000 340896.
## 4 0.2 200000 234974.
## 5 0.25 200000 129051.
## 6 0.3 200000 23128.
Now the data is available to plot!
Heat maps are a good visualization to show how 2 variables interact with a 3rd variable (the target)
scale_fill_gradient2 uses 2 gradient transitions by specifying low, mid and high arguments along with a midpoint to specify the middle transition point
scale_fill_gradient that offers only a low and high point.scale_fill_gradient_n allows for any number of gradients.scale_x_continuous - use the same breaks used earlier: seq(0.05, 0.30, by = 0.05)sensitivity_tbl %>%
ggplot(aes(avg_overtime_pct, net_revenue_per_employee)) + geom_tile(aes(fill = savings)) +
geom_label(aes(label = savings %>% round(0) %>% scales::dollar())) + theme_tq() + theme(legend.position = "none") +
scale_fill_gradient2(low = palette_light()[[2]], mid = "white", high = palette_light()[[1]], midpoint = 0) +
scale_x_continuous(labels = scales::percent, breaks = seq(0.05, 0.30, by = 0.05)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Profitability Heatmap: Expected Savings Sensitivity Analysis",
subtitle = "How sensitive is savings to net revenue per employee and average overtime percentage?",
x = "Average Overtime Percentage", y = "Net Revenue Per Employee")If you want more color variation, change the
midpointvalue. Trymidpoint = mean(sensitivity_tbl$savings):
Understanding the profitability heatmap.
Enhance the sensitivity analysis by adding stock options into the policy change simultaneously with OT reduction.
Will assume these values:
Changes in the code below to support Stock Options
data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel)) to capture the stock infomutate(StockOptionLevel = case_when(Yes >= threshold & StockOptionLevel == 0 ~ factor("1", levels = levels(data_0_tbl$StockOptionLevel)), TRUE ~ StockOptionLevel)) to capture when stock option changes from No to Yesind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel), data_1_tbl %>% select(OverTime, StockOptionLevel)) to include stock option in prediction tableStockOptionLevel_0 = StockOptionLevel, StockOptionLevel_1 = StockOptionLevel1 to rename stocks option to be similar to Overtime feature namingmutate(cost_SO = case_when(StockOptionLevel_1 == "1" & StockOptionLevel_0 == "0" ~ stock_option_cost, TRUE ~ 0)) to capture stock option cost when a change in options is founddata <- test_tbl
h2o_model <- automl_leader
threshold <- 0; tnr <- 0; fpr <- 1; fnr <- 0; tpr <- 1
avg_overtime_pct <- 0.10
net_revenue_per_employee <- 250000
stock_option_cost <- 5000
calculate_savings_by_threshold_3 <- function(data, h2o_model, threshold = 0, tnr = 0, fpr = 1, fnr = 0, tpr = 1,
avg_overtime_pct = 0.10, net_revenue_per_employee = 250000,
stock_option_cost = 5000){
data_0_tbl <- as.tibble(data)
#Calculating Expected Value With OT
pred_0_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() %>%
bind_cols(
# Changed in _3 Select StockOptionLevel Col----
data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel))
ev_0_tbl <- pred_0_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
# Changed in _2 ----
net_revenue_per_employee = net_revenue_per_employee)) %>% mutate(cost_of_policy_change = 0) %>%
mutate(expected_attrition_cost = Yes * (attrition_cost + cost_of_policy_change) + No * (cost_of_policy_change))
total_ev_0_tbl <- ev_0_tbl %>% summarise(total_expected_attrition_cost_0 = sum(expected_attrition_cost))
#Calculating Expected Value With Targeted OT & Stock Option Policy
data_1_tbl <- data_0_tbl %>% add_column(Yes = pred_0_tbl$Yes) %>%
mutate(OverTime = case_when(Yes >= threshold ~ factor("No", levels = levels(data_0_tbl$OverTime)),
TRUE ~ OverTime)) %>%
# Changed in _3 Create StockOption Var----
mutate(StockOptionLevel = case_when(Yes >= threshold & StockOptionLevel == 0
~ factor("1", levels = levels(data_0_tbl$StockOptionLevel)),
TRUE ~ StockOptionLevel)) %>% select(-Yes)
pred_1_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_1_tbl)) %>% as.tibble() %>%
# Changed in _3 Bind StockOption Col----
bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel),
data_1_tbl %>% select(OverTime, StockOptionLevel)) %>%
rename(OverTime_0 = OverTime, OverTime_1 = OverTime1,
# Changed in _3 Rename StockOptionLevel----
StockOptionLevel_0 = StockOptionLevel, StockOptionLevel_1 = StockOptionLevel1)
ev_1_tbl <- pred_1_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
# Changed in _2 ----
net_revenue_per_employee = net_revenue_per_employee)) %>%
# cost_OT
mutate(cost_OT = case_when(OverTime_1 == "No" & OverTime_0 == "Yes" ~ avg_overtime_pct * MonthlyIncome * 12, TRUE ~ 0)) %>%
# Changed in _3 Cost Stock----
mutate(cost_SO = case_when(StockOptionLevel_1 == "1" & StockOptionLevel_0 == "0" ~ stock_option_cost, TRUE ~ 0)) %>%
mutate(cost_of_policy_change = cost_OT + cost_SO) %>%
mutate(cb_tn = cost_of_policy_change, cb_fp = cost_of_policy_change,
cb_fn = attrition_cost + cost_of_policy_change, cb_tp = attrition_cost + cost_of_policy_change,
expected_attrition_cost = Yes * (tpr*cb_tp + fnr*cb_fn) + No * (tnr*cb_tn + fpr*cb_fp))
total_ev_1_tbl <- ev_1_tbl %>% summarise(total_expected_attrition_cost_1 = sum(expected_attrition_cost))
#Savings Calculation
savings_tbl <- bind_cols(total_ev_0_tbl, total_ev_1_tbl) %>%
mutate(savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
pct_savings = savings / total_expected_attrition_cost_0)
return(savings_tbl$savings)}Calculate the savings between doing nothing and max F1:
max_f1_tbl <- rates_by_threshold_tbl %>% select(threshold, f1, tnr:tpr) %>% filter(f1 == max(f1))
max_f1_savings <- calculate_savings_by_threshold_3(
test_tbl, automl_leader, threshold = max_f1_tbl$threshold,
tnr = max_f1_tbl$tnr, fpr = max_f1_tbl$fpr, fnr = max_f1_tbl$fnr, tpr = max_f1_tbl$tpr,
avg_overtime_pct = 0.10, net_revenue_per_employee = 250000, stock_option_cost = 5000)## [1] 809316
smpl <- seq(1, 220, length.out = 20) %>% round(digits = 0)
calculate_savings_by_threshold_3_preloded <- partial(calculate_savings_by_threshold_3,
data = test_tbl, h2o_model = automl_leader, avg_overtime_pct = 0.10,
net_revenue_per_employee = 250000, stock_option_cost = 5000)
rates_by_threshold_optimized_tbl_3 <- rates_by_threshold_tbl %>% select(threshold, tnr:tpr) %>%
slice(smpl) %>% mutate(savings = pmap_dbl(.l = list(threshold = threshold, tnr = tnr, fnr = fnr,
fpr = fpr, tpr = tpr),
.f = calculate_savings_by_threshold_3_preloded))## # A tibble: 15 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.966 1 0.972 0 0.0278 51241.
## 2 0.737 0.995 0.667 0.00543 0.333 517721.
## 3 0.450 0.978 0.444 0.0217 0.556 838874.
## 4 0.315 0.929 0.361 0.0707 0.639 1026488.
## 5 0.225 0.891 0.25 0.109 0.75 1097668.
## 6 0.181 0.826 0.25 0.174 0.75 1131973.
## 7 0.128 0.772 0.222 0.228 0.778 1125033.
## 8 0.113 0.712 0.194 0.288 0.806 1122143.
## 9 0.0930 0.652 0.194 0.348 0.806 1079308.
## 10 0.0737 0.598 0.139 0.402 0.861 1056028.
## 11 0.0648 0.543 0.111 0.457 0.889 1014063.
## 12 0.0570 0.478 0.111 0.522 0.889 991530.
## 13 0.0513 0.424 0.0833 0.576 0.917 963847.
## 14 0.0471 0.359 0.0833 0.641 0.917 905913.
## 15 0.0431 0.310 0.0278 0.690 0.972 892787.
## # A tibble: 1 x 6
## threshold tnr fnr fpr tpr savings
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.181 0.826 0.25 0.174 0.75 1131973.
rates_by_threshold_optimized_tbl_3 %>% ggplot(aes(threshold, savings)) +
# Vlines
geom_vline(xintercept = max_f1_tbl$threshold, color = palette_light()[[5]], size = 2) +
geom_vline(aes(xintercept = threshold), color = palette_light()[[3]], size = 2,
data = rates_by_threshold_optimized_tbl_3 %>% filter(savings == max(savings))) +
# Points
geom_line(color = palette_light()[[1]]) + geom_point(color = palette_light()[[1]]) +
# F1 Max
annotate(geom = "label", label = scales::dollar(max_f1_savings),
x = max_f1_tbl$threshold, y = max_f1_savings, vjust = -1, color = palette_light()[[1]]) +
# Optimal Point
geom_point(shape = 21, size = 5, color = palette_light()[[3]], data = rates_by_threshold_optimized_tbl_3 %>%
filter(savings == max(savings))) +
geom_label(aes(label = scales::dollar(savings)), vjust = -2, color = palette_light()[[3]],
data = rates_by_threshold_optimized_tbl_3 %>% filter(savings == max(savings))) +
# No OT Policy
geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl_3 %>%
filter(threshold == min(threshold))) +
geom_label(aes(label = scales::dollar(savings)), vjust = -1, color = palette_light()[[2]],
data = rates_by_threshold_optimized_tbl_3 %>% filter(threshold == min(threshold))) +
# Do Nothing Policy
geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl_3 %>%
filter(threshold == max(threshold))) +
geom_label(aes(label = scales::dollar(round(savings, 0))), vjust = -1, color = palette_light()[[2]],
data = rates_by_threshold_optimized_tbl_3 %>% filter(threshold == max(threshold))) +
# Aesthestics
theme_tq() + expand_limits(x = c(-.1, 1.1), y = 12e5) +
scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, by = 0.2)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = glue::glue("Optimization Results: Expected Savings Maximized at {max_f1_tbl$threshold * 100}"),
x = "Threshold (%)", y = "Savings")#Perform sensitivity analysis at optimal threshold ----
net_revenue_per_employee <- 250000
avg_overtime_pct <- seq(0.05, 0.30, by = 0.05)
stock_option_cost <- seq(5000, 25000, by = 5000)
max_savings_rates_tbl_3 <- rates_by_threshold_optimized_tbl_3 %>% filter(savings == max(savings))
calculate_savings_by_threshold_3_preloaded <- partial(calculate_savings_by_threshold_3,data = test_tbl,
h2o_model = automl_leader,
threshold = max_savings_rates_tbl_3$threshold,
tnr = max_savings_rates_tbl_3$tnr,
fnr = max_savings_rates_tbl_3$fnr,
fpr = max_savings_rates_tbl_3$fpr,
tpr = max_savings_rates_tbl_3$tpr)
calculate_savings_by_threshold_3_preloaded(avg_overtime_pct = 0.10,
net_revenue_per_employee = 250000, stock_option_cost = 5000)
sensitivity_tbl_3 <- list(avg_overtime_pct = seq(0.05, 0.30, by = 0.05), net_revenue_per_employee = 250000,
stock_option_cost = seq(5000, 25000, by = 5000)) %>% cross_df() %>%
mutate(savings = pmap_dbl(.l = list(avg_overtime_pct = avg_overtime_pct,
net_revenue_per_employee = net_revenue_per_employee,
stock_option_cost = stock_option_cost),
.f = calculate_savings_by_threshold_3_preloaded))## # A tibble: 15 x 4
## avg_overtime_pct net_revenue_per_employee stock_option_cost savings
## <dbl> <dbl> <dbl> <dbl>
## 1 0.05 250000 25000 334708.
## 2 0.1 250000 5000 1131973.
## 3 0.1 250000 20000 471973.
## 4 0.1 250000 25000 251973.
## 5 0.15 250000 5000 1049239.
## 6 0.15 250000 10000 829239.
## 7 0.15 250000 20000 389239.
## 8 0.15 250000 25000 169239.
## 9 0.2 250000 15000 526504.
## 10 0.2 250000 20000 306504.
## 11 0.25 250000 15000 443769.
## 12 0.25 250000 20000 223769.
## 13 0.3 250000 5000 801035.
## 14 0.3 250000 10000 581035.
## 15 0.3 250000 25000 -78965.
If an employee likelihood to leave > threshold. Recall Stock Option Cost is the value required to move an employee from 0 (no stock options) to 1.
Breakeven is where the row and column transition from negative to positive.
sensitivity_tbl_3 %>% ggplot(aes(avg_overtime_pct, stock_option_cost)) + geom_tile(aes(fill = savings)) +
geom_label(aes(label = savings %>% round(0) %>% scales::dollar())) + theme_tq() +
theme(legend.position = "none") +
scale_fill_gradient2(low = palette_light()[[2]], mid = "white",
high = palette_light()[[1]], midpoint = 0) +
scale_x_continuous(labels = scales::percent, breaks = seq(0.05, 0.30, by = 0.05)) +
scale_y_continuous(labels = scales::dollar, breaks = seq(5000, 25000, by = 5000)) +
labs(title = "Profitability Heatmap: Expected Savings Sensitivity Analysis",
subtitle = "How sensitive is savings to stock option cost and average overtime percentage?",
x = "Average Overtime Percentage", y = "Average Stock Option Cost")3 step process:
recommend_strategies() that outputs strategies by employee.The recommendation algorithm supports Step 3 under Encode Algorithms.
https://tidymodels.github.io/recipes/articles/Ordering.html
Recall recipes was used in Data Preparation when data_preparation_part_2_machine_readable.R was built. (#startRecipes) While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems:
recipes::one_hot provides a column for every category when dummying variables versus the default of one less column than the number of variables This is very beneficial for correlation analysis interpretability (It does give some warnings - OK in this case b/c there is not enough data for breaks)Recall what the training data looks like - it has been a while!
## Observations: 1,250
## Variables: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No,...
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Tra...
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department <fct> Sales, Research & Development, Resear...
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education <fct> College, Below College, College, Mast...
## $ EducationField <fct> Life Sciences, Life Sciences, Other, ...
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction <fct> Medium, High, Very High, Very High, L...
## $ Gender <fct> Female, Male, Male, Female, Male, Mal...
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement <fct> High, Medium, Medium, High, High, Hig...
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole <fct> Sales Executive, Research Scientist, ...
## $ JobSatisfaction <fct> Very High, Medium, High, High, Medium...
## $ MaritalStatus <fct> Single, Married, Single, Married, Mar...
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18 <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y...
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, N...
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating <fct> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <fct> Low, Very High, Medium, High, Very Hi...
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance <fct> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
JobLevel and StockOptionLevel are really factorsBuild the recipe for the training data and glimpse the result.
step_dummy(all_nominal() dummies all categorical datastep_discretize(all_numeric(), options = list(min_unique = 1))bake() to apply the recipe to the data.recipe_obj <- recipe(Attrition ~ ., data = train_readable_tbl) %>%
step_zv(all_predictors()) %>%
step_num2factor(factor_names) %>%
step_discretize(all_numeric(), options = list(min_unique = 1)) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
prep()
train_corr_tbl <- bake(recipe_obj, new = train_readable_tbl)
train_corr_tbl %>% glimpse()## Observations: 1,250
## Variables: 141
## $ Age_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Age_bin1 <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ Age_bin2 <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0,...
## $ Age_bin3 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ Age_bin4 <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ BusinessTravel_Non.Travel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BusinessTravel_Travel_Rarely <dbl> 1, 0, 1, 0, 1, 0, 1, 1, 0,...
## $ BusinessTravel_Travel_Frequently <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 1,...
## $ DailyRate_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ DailyRate_bin1 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1,...
## $ DailyRate_bin2 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ DailyRate_bin3 <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ DailyRate_bin4 <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ Department_Human.Resources <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Department_Research...Development <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Department_Sales <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ DistanceFromHome_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ DistanceFromHome_bin1 <dbl> 1, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ DistanceFromHome_bin2 <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ DistanceFromHome_bin3 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ DistanceFromHome_bin4 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ Education_Below.College <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0,...
## $ Education_College <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 0,...
## $ Education_Bachelor <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1,...
## $ Education_Master <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,...
## $ Education_Doctor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Human.Resources <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Life.Sciences <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1,...
## $ EducationField_Marketing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Medical <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0,...
## $ EducationField_Other <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Technical.Degree <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ EmployeeNumber_bin2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EnvironmentSatisfaction_Low <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ EnvironmentSatisfaction_Medium <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EnvironmentSatisfaction_High <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ EnvironmentSatisfaction_Very.High <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1,...
## $ Gender_Female <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ Gender_Male <dbl> 0, 1, 1, 0, 1, 1, 0, 1, 1,...
## $ HourlyRate_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HourlyRate_bin1 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1,...
## $ HourlyRate_bin2 <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0,...
## $ HourlyRate_bin3 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ HourlyRate_bin4 <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ JobInvolvement_Low <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobInvolvement_Medium <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1,...
## $ JobInvolvement_High <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ JobInvolvement_Very.High <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ JobLevel_X1 <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0,...
## $ JobLevel_X2 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ JobLevel_X3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ JobLevel_X4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobLevel_X5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Healthcare.Representative <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Human.Resources <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Laboratory.Technician <dbl> 0, 0, 1, 0, 1, 1, 1, 1, 0,...
## $ JobRole_Manager <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Manufacturing.Director <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ JobRole_Research.Director <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Research.Scientist <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0,...
## $ JobRole_Sales.Executive <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Sales.Representative <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobSatisfaction_Low <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ JobSatisfaction_Medium <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ JobSatisfaction_High <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 1,...
## $ JobSatisfaction_Very.High <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ MaritalStatus_Single <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 1,...
## $ MaritalStatus_Married <dbl> 0, 1, 0, 1, 1, 0, 1, 0, 0,...
## $ MaritalStatus_Divorced <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ MonthlyIncome_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MonthlyIncome_bin1 <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ MonthlyIncome_bin2 <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0,...
## $ MonthlyIncome_bin3 <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ MonthlyIncome_bin4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ MonthlyRate_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MonthlyRate_bin1 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ MonthlyRate_bin2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ MonthlyRate_bin3 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ MonthlyRate_bin4 <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0,...
## $ NumCompaniesWorked_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NumCompaniesWorked_bin1 <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1,...
## $ NumCompaniesWorked_bin2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NumCompaniesWorked_bin3 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ NumCompaniesWorked_bin4 <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ OverTime_No <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 1,...
## $ OverTime_Yes <dbl> 1, 0, 1, 1, 0, 0, 1, 0, 0,...
## $ PercentSalaryHike_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PercentSalaryHike_bin1 <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ PercentSalaryHike_bin2 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ PercentSalaryHike_bin3 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ PercentSalaryHike_bin4 <dbl> 0, 1, 0, 0, 0, 0, 1, 1, 1,...
## $ PerformanceRating_Low <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PerformanceRating_Good <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PerformanceRating_Excellent <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ PerformanceRating_Outstanding <dbl> 0, 1, 0, 0, 0, 0, 1, 1, 1,...
## $ RelationshipSatisfaction_Low <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ RelationshipSatisfaction_Medium <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ RelationshipSatisfaction_High <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0,...
## $ RelationshipSatisfaction_Very.High <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ StockOptionLevel_X0 <dbl> 1, 0, 1, 1, 0, 1, 0, 0, 1,...
## $ StockOptionLevel_X1 <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0,...
## $ StockOptionLevel_X2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ StockOptionLevel_X3 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ TotalWorkingYears_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TotalWorkingYears_bin1 <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ TotalWorkingYears_bin2 <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1,...
## $ TotalWorkingYears_bin3 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ TotalWorkingYears_bin4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TrainingTimesLastYear_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TrainingTimesLastYear_bin1 <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 1,...
## $ TrainingTimesLastYear_bin2 <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 0,...
## $ TrainingTimesLastYear_bin3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ WorkLifeBalance_Bad <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ WorkLifeBalance_Good <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0,...
## $ WorkLifeBalance_Better <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 1,...
## $ WorkLifeBalance_Best <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsAtCompany_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsAtCompany_bin1 <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 0,...
## $ YearsAtCompany_bin2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsAtCompany_bin3 <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 1,...
## $ YearsAtCompany_bin4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsInCurrentRole_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsInCurrentRole_bin1 <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 0,...
## $ YearsInCurrentRole_bin2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsInCurrentRole_bin3 <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 1,...
## $ YearsInCurrentRole_bin4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsSinceLastPromotion_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsSinceLastPromotion_bin1 <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 1,...
## $ YearsSinceLastPromotion_bin2 <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ YearsSinceLastPromotion_bin3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsWithCurrManager_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsWithCurrManager_bin1 <dbl> 0, 0, 1, 1, 1, 0, 1, 1, 0,...
## $ YearsWithCurrManager_bin2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsWithCurrManager_bin3 <dbl> 1, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ YearsWithCurrManager_bin4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Attrition_No <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 1,...
## $ Attrition_Yes <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0,...
tidy() is a context-specific function that converts non-data-frame objects to data frames (tibbles). From broom package.
recipes has a special implementation of tidy()Below provides the steps that were taken
## # A tibble: 4 x 5
## number operation type trained skip
## <int> <chr> <chr> <lgl> <lgl>
## 1 1 step zv TRUE FALSE
## 2 2 step num2factor TRUE FALSE
## 3 3 step discretize TRUE FALSE
## 4 4 step dummy TRUE FALSE
It is easy to return the details of the operation in the recipe hierarchy provided by tidy(recipe_obj). This s very helpful to understand the binning strategy the recipe function provided.
## # A tibble: 73 x 2
## terms value
## <chr> <dbl>
## 1 Age -Inf
## 2 Age 30
## 3 Age 36
## 4 Age 43
## 5 Age Inf
## 6 DailyRate -Inf
## 7 DailyRate 464
## 8 DailyRate 798
## 9 DailyRate 1156
## 10 DailyRate Inf
## # ... with 63 more rows
as.factor converts in the order the data appears in the datasetGoal is to get the correlation Yes information from the train_corr_level. First recall what the function get_cor does. get_cor Review tidy eval if needed.
## function (data, target, use = "pairwise.complete.obs", fct_reorder = FALSE,
## fct_rev = FALSE)
## {
## feature_expr <- enquo(target)
## feature_name <- quo_name(feature_expr)
## data_cor <- data %>% mutate_if(is.character, as.factor) %>%
## mutate_if(is.factor, as.numeric) %>% cor(use = use) %>%
## as.tibble() %>% mutate(feature = names(.)) %>% select(feature,
## !(!feature_expr)) %>% filter(!(feature == feature_name)) %>%
## mutate_if(is.character, as_factor)
## if (fct_reorder) {
## data_cor <- data_cor %>% mutate(feature = fct_reorder(feature,
## !(!feature_expr))) %>% arrange(feature)
## }
## if (fct_rev) {
## data_cor <- data_cor %>% mutate(feature = fct_rev(feature)) %>%
## arrange(feature)
## }
## return(data_cor)
## }
Attrion_No is not needed - only interested in employees that leaveget_cor, the table looks like this:## Warning in stats::cor(x, ...): the standard deviation is zero
## # A tibble: 6 x 2
## feature Attrition_Yes
## <fct> <dbl>
## 1 OverTime_No -0.240
## 2 StockOptionLevel_X1 -0.161
## 3 JobLevel_X2 -0.134
## 4 Department_Research...Development -0.105
## 5 YearsInCurrentRole_bin4 -0.102
## 6 TotalWorkingYears_bin4 -0.0994
relationship turning the factors in feature to a new column with type character so it can be manipulated easilytidyr::separate splits columns of character data into multiple columns using a separator
extra: If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:
mutate(feature_base = as_factor(feature_base) %>% fct_rev())
as_factor assigns factor levels in the order they appear. Therefore, Overtime would be assigned 1, JobLevel would be assigned 2. . . .fct_rev() changes this so Overtime level would become the highest level number, JobLevel assigned one less and so on. . .corr_level <- 0.06
correlation_results_tbl <- train_corr_tbl %>% select(-Attrition_No) %>%
get_cor(Attrition_Yes, fct_reorder = T, fct_rev = T) %>%
filter(abs(Attrition_Yes) >= corr_level) %>%
mutate(relationship = case_when(Attrition_Yes > 0 ~ "Supports", TRUE ~ "Contradicts")) %>%
mutate(feature_text = as.character(feature)) %>%
separate(feature_text, into = "feature_base", sep = "_", extra = "drop") %>%
mutate(feature_base = as_factor(feature_base) %>% fct_rev())length_unique_groups <- correlation_results_tbl %>% pull(feature_base) %>% unique() %>% length()
correlation_results_tbl %>%
ggplot(aes(Attrition_Yes, feature_base, color = relationship)) + geom_point() +
geom_label(aes(label = feature), vjust = -0.5) +
expand_limits(x = c(-0.3, 0.3),y = c(1, length_unique_groups + 1)) + theme_tq() +
scale_color_tq() + labs(title = "Correlation Analysis: Recommendation Strategy Development",
subtitle = "Discretizing features to help identify a strategy")The worksheet takes the information from the correlation analysis to develop strategies to implement a recommendation engine. The intent is to provide managers targeted strategies based on high risk factors.
Note in the worksheet the strategy group options provide a way to group features that appear to have affinity with one and other. These groups will be used in the Shiny App to be developed later.
Skip is selected under Potential Strategy because some features are not open for change/manipulation.
Strategy Group assignment is performed after the other columns are completed
This is where data scientists are important. They apply critical thinking to develop a coded algorithm combining data (data-driven) and knowledge of a system (business-driven). This recommendation algorithm becomes valuable to the business because once deployed, it drives decision making from process owners (in this case Managers)
The information below is copied from the Excel Spreadsheet
Using the information from the spreadsheet, develop logic that can be turned into code.
Good-Better-Best is just one type of approach.
Alternatively, consider for Personal Development, YearsAtCompany has the greatest separation in the Correlation Analysis Plot. Wouldn’t it’s position on the plot suggest it is the feature that needs the most attention and therefore the worst case? (I get that there is little you can really do as a manager for YAC. Just an example to try to understand the Best/Worse Case logic.) Response: This is a reasonable assessment. In reality, you’d come up with your logic, which may be imperfect. You’d present this to your management & stakeholders. They would recommend changes as necessary to tweak to their experience and the data, just as you are doing with me. This is the iterative process and communication. The goal is to get a story to tell, and they will suggest changes. This is good - because when you incorporate their changes, you gain buy-in!
Worst Case -Create Personal Development plan - Job Satisfaction - Job Involvement - ADDED PerformanceRating if low (because it is a logical thing to do)
Best Case 2 - Seek Leadership Role - JobInvolvement - JobSatisfaction - ADDED PerformanceRating Leaders tend to be very involved and have good performance metrics Want to identify leaders early. The main difference between mentors and leaders is leaders do not have a time component Leaders have high JobInvolvement which is a measure of engagement.
Using the information from the above strategy, code the logic into R code.
Before coding, understand how to discover bins were divided in the recipe. This is easy to do using tidy again. Here is an example using YearsAtCompany:
## # A tibble: 4 x 5
## number operation type trained skip
## <int> <chr> <chr> <lgl> <lgl>
## 1 1 step zv TRUE FALSE
## 2 2 step num2factor TRUE FALSE
## 3 3 step discretize TRUE FALSE
## 4 4 step dummy TRUE FALSE
Note the discretize type above. This tells use the binning occurs in step 3.
## # A tibble: 73 x 2
## terms value
## <chr> <dbl>
## 1 Age -Inf
## 2 Age 30
## 3 Age 36
## 4 Age 43
## 5 Age Inf
## 6 DailyRate -Inf
## 7 DailyRate 464
## 8 DailyRate 798
## 9 DailyRate 1156
## 10 DailyRate Inf
## # ... with 63 more rows
This provides all the discretized steps. In this example. only interested in YearsAtCompany:{#str_detect}
## # A tibble: 5 x 2
## terms value
## <chr> <dbl>
## 1 YearsAtCompany -Inf
## 2 YearsAtCompany 3
## 3 YearsAtCompany 5
## 4 YearsAtCompany 10
## 5 YearsAtCompany Inf
This defines the bins. Bin1 = <3, Bin2 = 3 - 4, Bin 3 = 5 - 9, Bin4 > 9. Knowing the definition of the bins is needed below. In practice, perform this function for each binned variable.
tidy(recipe_obj, number = 3) %>% filter(str_detect(terms, "TotalWorkingYears"))
tidy(recipe_obj, number = 3) %>% filter(str_detect(terms, "YearsInCurrent"))
tidy(recipe_obj, number = 3) %>% filter(str_detect(terms, "YearsAtCompany"))It is also important to know factor levels:
# Relate factor label to underlying numeric value
train_readable_tbl %>% pull(JobInvolvement) %>% levels()## [1] "Low" "Medium" "High" "Very High"
train_readable_tbl %>% pull(OverTime) %>% levels()
train_readable_tbl %>% pull(WorkLifeBalance) %>% levels()
train_readable_tbl %>% pull(BusinessTravel) %>% levels()The next step easy to do. Create an outline that provides sections for Worst Case, Better Case and Best Case. Then collect the measures that are part of each of these sections. Coding then becomes pretty easy! Just leveraging case_when with the logic that has been defined.
To illustrate, the worst case is presented below and finished with extra code to see the results. This is importunate as you translate logic into code. While the logic may seem sound at first, if the resulting spread does not make intuitive sense. the metrics assigned may need to be tweaked so the balance of the results make sense and are balanced is such a way that action can be taken effectively.
train_readable_tbl %>%
select(YearsAtCompany, TotalWorkingYears, YearsInCurrentRole, JobInvolvement, JobSatisfaction, PerformanceRating) %>%
mutate_if(is.factor, as.numeric) %>%
mutate(personal_development_strategy = case_when(
#Worst Case
PerformanceRating == 1 | JobSatisfaction == 1 |
JobInvolvement <= 2 ~ "Create Personal Development Plan",
TRUE ~ "Retain and Maintain")) %>% pull(personal_development_strategy) %>% table()## .
## Create Personal Development Plan Retain and Maintain
## 553 697
About 1/3 of the employees are being targeted to Create Personal Development Plan. That seems a bit much so change Job_Involvment from 12 to 1. This will reduce the number that appears to be more reasonable but the data shows thatJob_Involvement is an important measure so leave it at 2 for now.
The important thing to do is to always evaluate the outcome from a business perspective at each step.
case_when logic to reduce if appropriate (~308 if JobInvolvement == 1)train_readable_tbl %>%
select(YearsAtCompany, TotalWorkingYears, YearsInCurrentRole, JobInvolvement, JobSatisfaction, PerformanceRating) %>%
mutate_if(is.factor, as.numeric) %>%
mutate(personal_development_strategy = case_when(
#Worst Case
PerformanceRating == 1 | JobSatisfaction == 1 |
JobInvolvement <= 2 ~ "Create Personal Development Plan",
#Better Case
YearsAtCompany < 3 | TotalWorkingYears < 6 ~ "Promote Training and Formation",
#Best Case 1: Seek mentorship role
(YearsInCurrentRole > 3 | YearsAtCompany >= 5) & PerformanceRating >=3 & JobSatisfaction == 4
~ "Seek Mentorship Role",
#Best Case 2: Seek Leadership Role:
JobSatisfaction >= 3 & PerformanceRating >= 3 ~ "Seek Leadership Role",
# Catch All
TRUE ~ "Retain and Maintain")) %>% pull(personal_development_strategy) %>% table()## .
## Create Personal Development Plan Promote Training and Formation
## 553 216
## Retain and Maintain Seek Leadership Role
## 111 195
## Seek Mentorship Role
## 175
Progressive Targeting: The logic based system uses an if-style evaluation progressively isolating the most concerning groups to the least concerning groups.
Glimpse at the results:
## Observations: 1,250
## Variables: 7
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5...
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17...
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4,...
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4,...
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2,...
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3,...
## $ personal_development_strategy <chr> "Seek Mentorship Role", "Create ...
The goal is not perfection. This will develop over iterations and decision-maker feedback before and during deployment.
Focused on promotion, rotation and specialization in their job. Rotation keeps people fresh, excited and helps understand multiple aspects of the business. Specialization focuses on becoming the best specialist. Often for highly technical people that love their job (high satisfaction) but are not interested in leadership.
Also use the Correlation Analysis Plot for guidance.
Progressively make it more difficult to become promoted. This is a good thing because it will incentivize the attribute of performance metrics that are important to the business. These key characteristics should be communicated to employees during performance reviews.
Focused on promotion, rotation and specialization in their job Rotation keeps people fresh, excited and helps understand multiple aspects of the business Specialization focuses on becoming the best specialist. Often for highly technical people that love their job (high satisfaction) but are not interested in leadership
JobLevel - Higher performers in entry level positions (Level 1) are leaving. Job Level 2 are staying.
- Promote faster for high performing employees
YearsAtCompany (YAC) - YAC - High - Likely to stay / YAC - Low - Likely to leave - Tie promotion if Low to advance faster / Mentor if YAC low
YearsInCurrentRole (YiCR)
- More time in current role related to lower attrition
- Incentivize specialization or promote - Preferred as main drive over YAC
Additional Features - JobInvolvement - important for promotion readiness, incentivizes involvement for leaders and early promotion - JobSatisfaction - Important for specialization, incentivizes satisfaction for mentors - PerformanceRating - Important for any promotion
Good Better Best Approach - Typical Career Paths: 1 - Leadership (management); 2- Specialist (Technical) - Both should have promotion schedules based on clear metrics
Ready for rotation: YearsInCurrentRole, JobInvovement, PerformanceRating (LOW) Ready for Promotion Level 2: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Ready for Promotion Level 3: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Ready for Promotion Level 4: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Ready for Promotion Level 5: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Incentives Specialization: YearsInCurrentRole, JobInvovement, PerformanceRating
train_readable_tbl %>%
select(JobLevel, YearsInCurrentRole, JobInvolvement, JobSatisfaction, PerformanceRating) %>%
mutate_if(is.factor, as.numeric) %>%
mutate(
professional_development_strategy = case_when(
# Ready for rotation: YearsInCurrentRole, JobInvovement, PerformanceRating (LOW)
YearsInCurrentRole >= 2 & JobSatisfaction <=2 ~ "Ready for Rotation",
# Ready for Promotion Level 2: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
JobLevel == 1 & YearsInCurrentRole >= 2 & JobInvolvement >= 3 &
PerformanceRating >= 3 ~ "Ready for Promotion",
# Ready for Promotion Level 3: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
JobLevel == 2 & YearsInCurrentRole >= 2 & JobInvolvement >= 4 &
PerformanceRating >= 3 ~ "Ready for Promotion",
# Ready for Promotion Level 4: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
JobLevel == 3 & YearsInCurrentRole >= 3 & JobInvolvement >= 4 &
PerformanceRating >= 3 ~ "Ready for Promotion",
# Only a few added at this point but that is probaly OK, pretty high up in the company
# Ready for Promotion Level 5: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
JobLevel == 4 & YearsInCurrentRole >= 4 & JobInvolvement >= 4 &
PerformanceRating >= 3 ~ "Ready for Promotion",
# Only a few added at this point but that is probaly OK, pretty high up in the company
# Incentives Specialization: YearsInCurrentRole, JobInvovement, PerformanceRating
YearsInCurrentRole >= 4 & JobSatisfaction >= 4 &
PerformanceRating >= 3 ~ "Incentivizes Specialization",
# Catch All
TRUE ~ "Retain and Maintain")) %>% pull(professional_development_strategy) %>% table() ## .
## Incentivizes Specialization Ready for Promotion
## 138 182
## Ready for Rotation Retain and Maintain
## 384 546
Good Better Best Approach
train_readable_tbl %>%
select(OverTime, EnvironmentSatisfaction, WorkLifeBalance, BusinessTravel, DistanceFromHome, YearsInCurrentRole, JobInvolvement) %>%
mutate_if(is.factor, as.numeric) %>%
mutate(
work_environment_strategy = case_when(
# Improve work life: Overtime, WorklifeBalance
OverTime == 2 | WorkLifeBalance == 1 ~ "Improve Work-Life Balance",
#Recall Overtime 2 == YES: train_readable_tbl %>% pull(OverTime) %>% levels()
# Monitor Business Travel: BusinessTravel, DistanceFromHome, WorkLifeBalance
(BusinessTravel == 3 | DistanceFromHome >= 10) & WorkLifeBalance ==2 ~ "Monitor Business Travel",
# WorkLifeBalance = 2 to differentiate from the case_when above
# Review Job Assignment: EnvironmentSatisfaction, YearsInCurrentRole
EnvironmentSatisfaction == 1 & YearsInCurrentRole >=2 ~ "Review Job Assignment",
# Give the new employee enough time to adjust therefore the 2 years
# Promote Job Engagement: JobInvolvement - just a rational addition
JobInvolvement <=2 ~ "Promote Job Involvement",
TRUE ~ "Retain and Maintain")) %>% count(work_environment_strategy)## # A tibble: 5 x 2
## work_environment_strategy n
## <chr> <int>
## 1 Improve Work-Life Balance 407
## 2 Monitor Business Travel 105
## 3 Promote Job Involvement 182
## 4 Retain and Maintain 423
## 5 Review Job Assignment 133
Review all the work performed so far:
## Observations: 1,250
## Variables: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No,...
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Tra...
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department <fct> Sales, Research & Development, Resear...
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education <fct> College, Below College, College, Mast...
## $ EducationField <fct> Life Sciences, Life Sciences, Other, ...
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction <fct> Medium, High, Very High, Very High, L...
## $ Gender <fct> Female, Male, Male, Female, Male, Mal...
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement <fct> High, Medium, Medium, High, High, Hig...
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole <fct> Sales Executive, Research Scientist, ...
## $ JobSatisfaction <fct> Very High, Medium, High, High, Medium...
## $ MaritalStatus <fct> Single, Married, Single, Married, Mar...
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18 <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y...
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, N...
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating <fct> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <fct> Low, Very High, Medium, High, Very Hi...
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance <fct> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
data <- train_readable_tbl
employee_number <- 19
recommend_function <- function(data, employee_number){
data %>% filter(EmployeeNumber == employee_number) %>% mutate_if(is.factor, as.numeric) %>%
#Personal Development Strategy
mutate(
personal_development_strategy = case_when(
PerformanceRating == 1 | JobSatisfaction == 1 |
JobInvolvement <= 2 ~ "Create Personal Development Plan",
YearsAtCompany < 3 | TotalWorkingYears < 6
~ "Promote Training and Formation",
(YearsInCurrentRole > 3 | YearsAtCompany >= 5) & PerformanceRating >=3 &
JobSatisfaction ==4 ~ "Seek Mentorship Role",
JobInvolvement >= 3 & JobSatisfaction >= 3 &
PerformanceRating >= 3 ~ "Seek Leadership Role", TRUE ~ "Retain and Maintain")) %>%
# Professional Development Strategy
mutate(
professional_development_strategy = case_when(
YearsInCurrentRole >= 2 & JobSatisfaction <=2 ~ "Ready for Rotation",
JobLevel == 1 & YearsInCurrentRole >= 2 & JobInvolvement >= 3 &
PerformanceRating >= 3 ~ "Ready for Promotion",
JobLevel == 2 & YearsInCurrentRole >= 2 & JobInvolvement >= 4 &
PerformanceRating >= 3 ~ "Ready for Promotion",
JobLevel == 3 & YearsInCurrentRole >= 3 & JobInvolvement >= 4 &
PerformanceRating >= 3 ~ "Ready for Promotion",
JobLevel == 4 & YearsInCurrentRole >= 4 & JobInvolvement >= 4 &
PerformanceRating >= 3 ~ "Ready for Promotion",
YearsInCurrentRole >= 4 & JobSatisfaction >= 4 &
PerformanceRating >= 3 ~ "Incentivizes Specialization",
TRUE ~ "Retain and Maintain")) %>%
# Work Environment Strategy
mutate(
work_environment_strategy = case_when(
OverTime ==2 | WorkLifeBalance ==1 ~ "Improve Work-Life Balance",
(BusinessTravel ==3 | DistanceFromHome >= 10) &
WorkLifeBalance ==2 ~ "Monitor Business Travel",
# WorkLifeBalance = 2 to differentiate from the case_when above
EnvironmentSatisfaction ==1 & YearsInCurrentRole >=2 ~ "Review Job Assignment",
JobInvolvement <=2 ~ "Promote Job Involvement", TRUE ~ "Retain and Maintain"))%>%
select(EmployeeNumber, personal_development_strategy,
professional_development_strategy, work_environment_strategy)}Test the recommendation algorithm.
## # A tibble: 1,250 x 1
## EmployeeNumber
## <dbl>
## 1 1
## 2 2
## 3 4
## 4 5
## 5 7
## 6 8
## 7 10
## 8 11
## 9 12
## 10 13
## # ... with 1,240 more rows
## # A tibble: 1 x 4
## EmployeeNumber personal_developme~ professional_devel~ work_environment~
## <dbl> <chr> <chr> <chr>
## 1 1 Seek Mentorship Ro~ Incentivizes Speci~ Improve Work-Lif~
## # A tibble: 1 x 4
## EmployeeNumber personal_developme~ professional_devel~ work_environment~
## <dbl> <chr> <chr> <chr>
## 1 2 Create Personal De~ Ready for Rotation Promote Job Invo~
## # A tibble: 1 x 4
## EmployeeNumber personal_developme~ professional_devel~ work_environment~
## <dbl> <chr> <chr> <chr>
## 1 4 Create Personal De~ Retain and Maintain Improve Work-Lif~
## # A tibble: 0 x 4
## # ... with 4 variables: EmployeeNumber <dbl>,
## # personal_development_strategy <chr>,
## # professional_development_strategy <chr>,
## # work_environment_strategy <chr>
## # A tibble: 220 x 1
## EmployeeNumber
## <dbl>
## 1 228
## 2 1278
## 3 1250
## 4 2065
## 5 1767
## 6 1308
## 7 18
## 8 460
## 9 1369
## 10 1040
## # ... with 210 more rows
## # A tibble: 1 x 4
## EmployeeNumber personal_developme~ professional_devel~ work_environment~
## <dbl> <chr> <chr> <chr>
## 1 1767 Create Personal De~ Retain and Maintain Improve Work-Lif~